Comparative variation analysis

Complete code supplement

Author

Jason Grafmiller
Benedikt Szmrecsanyi

Published

July 14, 2023

Introduction

This document contains the complete R code for the analyses presented in Szmrecsanyi & Grafmiller (2023) Comparative variation analysis: Grammatical alternations in World Englishes, Cambridge University Press. All code for the plots and analyses in the book can be found here. Some code is folded for readability, but you can see the code by clicking the “Code” buttons above the tables or figures, or clicking “Show All Code” in the </> code menu in the upper right. The full Quarto source code for this document can be found under “View Source”.

This document follows the chapter structure in the book, reconstructing all the quantitative analyses in Chapters 4-7. For readability, supplementary code for some parts has been moved to an Appendix (Chapter 6). All datasets and other files are available in the accompanying Open Science Framework repository: https://osf.io/5hvtw/.

R setup

Libraries

Load the libraries.

pkgs <- c(
  "here", # for file path management
  "knitr", # for knitting rmarkdown
  "kableExtra", # for table styling
  "gmodels", # for crosstables
  "hypr", # for calculating custom factor contrast codings
  "lme4", # for mixed-effects models
  "lmerTest", # significance tests for GLMMs
  "optimx", # lme4 optimizer
  "MuMIn", # for r.squaredGLMM
  "car", # for recoding
  "caret", # for evaluating model
  "performance", # for evaluating model
  "parameters", # for summarizing model
  "ggeffects", # get partial effects from models
  "Hmisc", # for `somers2()` function

  "party", # for random forests
  "permimp", # for faster varimp for {party}
  "phangorn", # for neighborNets

  "analogue", # for fusing distance matrices
  "vegan", # for Mantel tests of distance matrix correlations

  "remotes", # for installing from GitHub
  "tidyverse", # for data wrangling and plotting: dplyr, ggplot2, etc.
  "ggdendro", # for working with dendrograms in ggplot2
  "patchwork", # for combining ggplots
  "scales", # for adding scales to ggplots
  "ggrepel", # for nicely annotating points with text
  "extrafont" # for fonts in figures
)

Install packages (if not not already installed) and load them.

install.packages(pkgs[!pkgs %in% installed.packages()[, 1]])
invisible(lapply(pkgs, FUN = function(x) suppressMessages(library(x, character.only = TRUE))))
rm(pkgs) # remove list

Install and load the {VADIS} package.

# remotes::install_github("jasongraf1/VADIS")
library(VADIS)

Custom functions

The file custom_functions.R contains functions used for data wrangling and other things.

source("custom_functions.R")

Plot styling

Set style for figures.

extrafont::loadfonts(device = "win")
text_cols <- c("black", "white")

theme_cup <- function() {
  # Set the theme
  theme_classic() %+replace%
    theme(
      strip.text.x = element_text(
        size = 10, color = "black",
        hjust = 0, face = "bold", family = "serif"
      ),
      strip.background = element_blank(),
      axis.text.x = element_text(color = "black", size = 12),
      axis.text.y = element_text(color = "black", size = 12),
      axis.title = element_text(size = rel(1.3)),
      plot.title = element_text(
        size = rel(1.4), color = "black", hjust = 0,
        face = "bold",
        margin = margin(b = 10)
      ),
      axis.line = element_line(color = "black"),
      panel.grid.minor = element_blank(),
      text = element_text(family = "serif")
    )
}

Datasets

Load in the annotated datasets from the corpora and preprocess them. Descriptions of these datasets are found in Chapter 4 of the book, and further details are provided in the annotation manuals contained in the OSF repository: https://osf.io/5hvtw/.

For each we collapse infrequent levels of lexical items and texts, which will be included as random effects in the regression models.

Set variety labels order.

var_levs <- c("BrE", "CanE", "IrE", "NZE", "HKE", "IndE", "JamE", "PhlE", "SgE")

Genitives

data_genitives <- readRDS(here("data", "data_gentives_CUP.rds")) |>
  as.data.frame()

data_genitives$PorHeadLemma_pruned <- filter_infrequent(data_genitives$POR_HEAD_LEMMA, 20)
data_genitives$PumHeadLemma_pruned <- filter_infrequent(data_genitives$PUM_HEAD_LEMMA, 20)
data_genitives$FileID_pruned <- filter_infrequent(data_genitives$PUM_HEAD_LEMMA, 20)
data_genitives$variety_of_E <- factor(data_genitives$variety_of_E, levels = var_levs)

Datives

data_datives <- here("data", "data_datives_CUP.rds") |>
  readRDS()

data_datives$Verb_pruned <- filter_infrequent(data_datives$Verb, 20)
data_datives$RecHeadPlain_pruned <- filter_infrequent(data_datives$RecHeadPlain, 20)
data_datives$ThemeHeadPlain_pruned <- filter_infrequent(data_datives$ThemeHeadPlain, 20)
data_datives$FileID_pruned <- filter_infrequent(data_datives$FileID, 20) # less than 20 and the glmer model won't converge
data_datives$variety_of_E <- factor(data_datives$variety_of_E, levels = var_levs)

Particle verbs

data_particle_verbs <- here("data", "data_particle_verbs_CUP.rds") |>
  readRDS()

data_particle_verbs$VerbPart_pruned <- filter_infrequent(data_particle_verbs$VerbPart, 20)
data_particle_verbs$Verb_pruned <- filter_infrequent(data_particle_verbs$Verb, 20)
data_particle_verbs$FileID_pruned <- filter_infrequent(data_particle_verbs$FileID, 20)
data_particle_verbs$variety_of_E <- factor(data_particle_verbs$variety_of_E, levels = var_levs)

Chapter 4

This chapter introduces the corpora used, and describes the data collection and annotation procedures.

Table 4.2 of the size fo the GloWbE corpus.

Code
data_glowbe_corpus <- read.csv(here("data", "glowbe_corpus_size.csv")) |>
  rename(Country = "X")
data_glowbe_corpus |>
  kable() |>
  kable_styling()
Table 2.1: Design of the GloWbE corpus (from Davies and Fuchs 2015, 6).
Country Code Websites Webpages Words
UnitedStates US 82260 275156 386809355
Canada CA 33776 135692 134765381
GreatBritain GB 64351 381841 387615074
Ireland IE 15840 102147 101029231
Australia AU 28881 129244 148208169
NewZealand NZ 14053 82679 81390476
India IN 18618 113765 96430888
SriLanka LK 4208 38389 46583115
Pakistan PK 4955 42769 51367152
Bangladesh BD 5712 45059 39658255
Singapore SG 8339 45459 42974705
Malaysia MY 8966 45601 42420168
Philippines PH 10224 46342 43250093
HongKong HK 8740 43936 40450291
SouthAfrica ZA 10308 45264 45364498
Nigeria NG 4516 37285 42646098
Ghana GH 3616 47351 38768231
Kenya KE 5193 45962 41069085
Tanzania TZ 4575 41356 35169042
Jamaica JM 3488 46748 39663666
Total 340619 1792045 1885632973

Tables 4.3-5 showign distribution of alternation variants by corpus and variety.

Code
gens_df_split <- split(data_genitives, data_genitives$Corpus)
ice_summary <- table(
  gens_df_split$ICE$variety_of_E,
  gens_df_split$ICE$Response
) |>
  make_prop_table()
glowbe_summary <- table(
  gens_df_split$GloWbE$variety_of_E,
  gens_df_split$GloWbE$Response
) |>
  make_prop_table()
cbind(ice_summary, glowbe_summary) |>
  kable() |>
  add_header_above(c(" ", "ICE" = 5, "GloWbE" = 5), bold = T) |>
  kable_styling()
Table 2.2: Summary of genitive variants in ICE and GloWbE corpus data
ICE
GloWbE
of % s % Total of % s % Total
BrE 787 74.0 276 26.0 1063 388 70.7 161 29.3 549
CanE 631 63.2 368 36.8 999 207 57.3 154 42.7 361
IrE 629 70.8 259 29.2 888 294 69.2 131 30.8 425
NZE 770 70.6 321 29.4 1091 525 67.7 251 32.3 776
HKE 719 69.8 311 30.2 1030 320 64.1 179 35.9 499
IndE 904 79.5 233 20.5 1137 477 68.2 222 31.8 699
JamE 744 81.0 174 19.0 918 265 68.1 124 31.9 389
PhlE 898 74.8 302 25.2 1200 360 65.6 189 34.4 549
SgE 609 67.1 299 32.9 908 195 61.5 122 38.5 317
Total 6691 NA 2543 NA 9234 3031 NA 1533 NA 4564
Code
dats_df_split <- split(data_datives, data_datives$Corpus)
ice_summary <- table(
  dats_df_split$ice$variety_of_E,
  dats_df_split$ice$Response
) |>
  make_prop_table()
glowbe_summary <- table(
  dats_df_split$glowbe$variety_of_E,
  dats_df_split$glowbe$Response
) |>
  make_prop_table()
cbind(ice_summary, glowbe_summary) |>
  kable() |>
  add_header_above(c(" ", "ICE" = 5, "GloWbE" = 5), bold = T) |>
  kable_styling()
Table 2.3: Summary of dative variants in ICE and GloWbE corpus data
ICE
GloWbE
do % pd % Total do % pd % Total
BrE 640 73.2 234 26.8 874 292 65.8 152 34.2 444
CanE 671 72.9 250 27.1 921 297 68.1 139 31.9 436
IrE 645 74.4 222 25.6 867 279 69.2 124 30.8 403
NZE 736 71.3 296 28.7 1032 320 70.6 133 29.4 453
HKE 841 66.0 433 34.0 1274 288 58.3 206 41.7 494
IndE 608 56.3 471 43.7 1079 304 64.5 167 35.5 471
JamE 683 72.4 260 27.6 943 294 68.1 138 31.9 432
PhlE 661 65.5 348 34.5 1009 334 67.3 162 32.7 496
SgE 772 72.9 287 27.1 1059 322 66.5 162 33.5 484
Total 6257 NA 2801 NA 9058 2730 NA 1383 NA 4113
Code
pv_df_split <- split(data_datives, data_datives$Corpus)
ice_summary <- table(
  pv_df_split$ice$variety_of_E,
  pv_df_split$ice$Response
) |>
  make_prop_table()
glowbe_summary <- table(
  pv_df_split$glowbe$variety_of_E,
  pv_df_split$glowbe$Response
) |>
  make_prop_table()
cbind(ice_summary, glowbe_summary) |>
  kable() |>
  add_header_above(c(" ", "ICE" = 5, "GloWbE" = 5), bold = T) |>
  kable_styling()
Table 2.4: Summary of particle placement variants in ICE and GloWbE corpus data
ICE
GloWbE
do % pd % Total do % pd % Total
BrE 640 73.2 234 26.8 874 292 65.8 152 34.2 444
CanE 671 72.9 250 27.1 921 297 68.1 139 31.9 436
IrE 645 74.4 222 25.6 867 279 69.2 124 30.8 403
NZE 736 71.3 296 28.7 1032 320 70.6 133 29.4 453
HKE 841 66.0 433 34.0 1274 288 58.3 206 41.7 494
IndE 608 56.3 471 43.7 1079 304 64.5 167 35.5 471
JamE 683 72.4 260 27.6 943 294 68.1 138 31.9 432
PhlE 661 65.5 348 34.5 1009 334 67.3 162 32.7 496
SgE 772 72.9 287 27.1 1059 322 66.5 162 33.5 484
Total 6257 NA 2801 NA 9058 2730 NA 1383 NA 4113

Chapter 5

Chapter 5 focuses on analysis of individual alternations. For each alternation we fit single random forest model to the entire dataset and assess variable importance, then we examine variable importance for each variety individually based on separate by-variety forests. Finally, we fit a mixed-effects regression model to the entire dataset and examine significant interactions between internal constraints and variety.

5.2 Genitive alternation

Cross tabulate genitive variant and variety of English.

table(data_genitives$variety_of_E, data_genitives$Response) |>
  make_prop_table() |>
  kable() |>
  kable_styling()
Table 3.1: Variant rates of genitive constructions across varieties of English. Upper half of the table: Inner Circle varieties of English; lower half of the table: Outer Circle varieties of English.
of % s % Total
BrE 1175 72.9 437 27.1 1612
CanE 838 61.6 522 38.4 1360
IrE 923 70.3 390 29.7 1313
NZE 1295 69.4 572 30.6 1867
HKE 1039 68.0 490 32.0 1529
IndE 1381 75.2 455 24.8 1836
JamE 1009 77.2 298 22.8 1307
PhlE 1258 71.9 491 28.1 1749
SgE 804 65.6 421 34.4 1225
Total 9722 NA 4076 NA 13798

Variable importance in global random forest model

Now fit conditional random forest to the full genitives dataset.

# formula
gen_f1 <- Response ~
  PorAnimacyBin +
  PorLength +
  PumLength +
  PorNPexprTypeBin +
  PorFinalSibilancy +
  PreviousChoice +
  SemanticRelationBin +
  PorHeadFreq

gen_f2 <- update(gen_f1, . ~ . + variety_of_E + Circle + GenreCoarse)
gen_f3 <- update(gen_f1, . ~ . + GenreCoarse)

Run the forest and save output. {party} models are large and slow, so save output for reloading. Note that model outputs are not currently stored on OSF.

if (file.exists(here("model_output", "gen_forest1.rds"))) {
  # load file if already run
  gen_forest1 <- readRDS(here("model_output", "gen_forest1.rds"))
} else {
  # if file doesn't already exist, run the analysis and save to file
  gen_forest1 <- party::cforest(gen_f2, data_genitives)
  saveRDS(gen_forest1, here("model_output", "gen_forest1.rds"))
}

Get predictions, again saving output for quicker loading later.

if (file.exists(here("model_output", "gen_forest1_preds.rds"))) {
  gen_forest1_preds <- readRDS(here("model_output", "gen_forest1_preds.rds"))
} else {
  gen_forest1_preds <- unlist(party::treeresponse(gen_forest1))[c(FALSE, TRUE)]
  saveRDS(gen_forest1_preds, here("model_output", "gen_forest1_preds.rds"))
}

Check predictive performance.

data_genitives$fitted_crf1 <- gen_forest1_preds
data_genitives$predicted_crf1 <- as.factor(ifelse(data_genitives$fitted_crf1 >= .5, "s", "of"))
caret::confusionMatrix(data_genitives$predicted_crf1, data_genitives$Response,
  mode = "prec_recall"
)
Confusion Matrix and Statistics

          Reference
Prediction   of    s
        of 9126 1150
        s   596 2926
                                         
               Accuracy : 0.8735         
                 95% CI : (0.8678, 0.879)
    No Information Rate : 0.7046         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.6835         
                                         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
              Precision : 0.8881         
                 Recall : 0.9387         
                     F1 : 0.9127         
             Prevalence : 0.7046         
         Detection Rate : 0.6614         
   Detection Prevalence : 0.7447         
      Balanced Accuracy : 0.8283         
                                         
       'Positive' Class : of             
                                         

Check Area under ROC curve. This is alternatively known as the concordance index c (Harrell 2001:105, 257).

data.frame(
  obs = data_genitives$Response,
  pred = data_genitives$predicted_crf1,
  of = 1 - data_genitives$fitted_crf1,
  s = data_genitives$fitted_crf1
) |>
  caret::twoClassSummary(lev = levels(data_genitives$Response))
      ROC      Sens      Spec 
0.9433525 0.9386957 0.7178606 

Get (unconditional) variable importance scores, saving outputs again.

if (!file.exists(here("model_output", "gen_forest1_varimp.rds"))) {
  gen_forest1_varimp <- permimp(gen_forest1, conditional = FALSE, progressBar = FALSE, asParty = TRUE)
  saveRDS(gen_forest1_varimp, here("model_output", "gen_forest1_varimp.rds"))
} else {
  gen_forest1_varimp <- here("model_output", "gen_forest1_varimp.rds") |>
    readRDS()
}
invisible(gc(verbose = FALSE)) # free up usused memory (party RFs are big)

Create Figure 5.1 ploting the variable importances for the entire genitives dataset.

data.frame(Varimp = gen_forest1_varimp$values) |>
  # rename(Varimp = "gen_forest1_varimp$values") |>
  rownames_to_column("Variable") |>
  ggplot(aes(reorder(Variable, Varimp), Varimp)) +
  geom_hline(yintercept = 0) +
  geom_segment(aes(xend = Variable, yend = 0), col = "black") +
  geom_point(col = "black") +
  coord_flip() +
  theme_cup() +
  labs(x = "", y = "Variable Importance")

Figure 3.1: CRF permutation variable importance ranking for the pooled genitive dataset. C = 0.94.

Variable importance in by-variety random forest variable importances

Next we fit RFs to each variety individually and calculate the variable importance. First split the data.

data_genitives_split <- split(data_genitives, data_genitives$variety_of_E)

Now fit random forest for each variety.

if (file.exists(here("model_output", "gen_crf_list.rds"))) {
  genitives_variety_forests <- readRDS(
    here("model_output", "gen_crf_list.rds")
  )
} else {
  genitives_variety_forests <- lapply(
    data_genitives_split, function(x) fit_party_crf(gen_f3, x)
  )
  saveRDS(genitives_variety_forests, here("model_output", "gen_crf_list.rds"))
}

Now get the (unconditional) variable importance scores.

if (file.exists(here("model_output", "gen_crf_varimps.rds"))) {
  genitives_variety_varimps <- readRDS(
    here("model_output", "gen_crf_varimps.rds")
  )
} else {
  genitives_variety_varimps <- lapply(
    genitives_variety_forests, function(x) get_party_varimp(x)$values
  )
  saveRDS(genitives_variety_varimps, here("model_output", "gen_crf_varimps.rds"))
}

Create Figure 5.2 showing by-variety variable importance.

Code
preds <- genitives_variety_varimps[[1]] |>
  sort() |>
  names()

lapply(
  seq_along(genitives_variety_varimps),
  function(i) {
    x <- genitives_variety_varimps[[i]]
    x |>
      as.data.frame() |>
      rename(Varimp = 1) |>
      rownames_to_column("Variable") |>
      mutate(Variable = factor(Variable, levels = preds)) |>
      # bind_rows() |>
      # mutate(
      #   Variety = rep(names(genitives_variety_varimps), each = 9)
      # ) |>
      ggplot(aes(Variable, Varimp)) +
      geom_hline(yintercept = 0) +
      geom_segment(aes(xend = Variable, yend = 0), col = "black") +
      geom_point(col = "black") +
      coord_flip() +
      # facet_wrap(~Variety, ncol = 3) +
      theme_cup() +
      labs(x = "", y = "", title = names(genitives_variety_varimps)[i])
  }
) |>
  wrap_plots(ncol = 3) +
  plot_annotation(caption = "Click to enlarge")

Figure 3.2: Conditional Random Forest permutation variable importance ranking of constraints on the genitive alternation by variety of English.

Variety interactions in mixed-effects regression model

Next fit a generalized mixed-effects regression model to the genitives dataset.

Set reference levels.

data_genitives$Response <- relevel(data_genitives$Response, ref = "of")
data_genitives$PorAnimacyBin <- relevel(data_genitives$PorAnimacyBin, ref = "inanimate")
data_genitives$PorNPexprTypeBin <- relevel(data_genitives$PorNPexprTypeBin, ref = "other")
data_genitives$PorFinalSibilancy <- relevel(data_genitives$PorFinalSibilancy, ref = "final_sibilant_absent")

Formula for genitives.

f_reg_gens <- Response ~
  PorAnimacyBin +
  PorLength +
  PumLength +
  PorNPexprTypeBin +
  PorFinalSibilancy +
  variety_of_E +

  PorAnimacyBin:variety_of_E +
  PorFinalSibilancy:variety_of_E +

  (1 | FileID_pruned) +
  (1 | PumHeadLemma_pruned)

Set controls for glmer()

glmer_ctrl <- glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 1e7))

Fit mixed-effects logistic regression model:

if (!file.exists(here("model_output", "gens_glmer1.rds"))) {
  glmer_genitives1 <- glmer(f_reg_gens,
    data = data_genitives,
    family = binomial, control = glmer_ctrl
  )
  saveRDS(glmer_genitives1, here("model_output", "gens_glmer1.rds"))
} else {
  glmer_genitives1 <- readRDS(here("model_output", "gens_glmer1.rds"))
}

Check the model fit.

performance::model_performance(glmer_genitives1)
# Indices of model performance

AIC       |      AICc |       BIC | R2 (cond.) | R2 (marg.) |   ICC |  RMSE | Sigma | Log_loss | Score_log | Score_spherical
----------------------------------------------------------------------------------------------------------------------------
10403.316 | 10403.469 | 10644.349 |      0.738 |      0.629 | 0.293 | 0.340 | 1.000 |    0.363 |      -Inf |           0.001

Precision and recall.

data_genitives$fitted_glmer1 <- fitted(glmer_genitives1)
data_genitives$predicted_glmer1 <- as.factor(ifelse(data_genitives$fitted_glmer1 >= .5, "s", "of"))
caret::confusionMatrix(data_genitives$predicted_glmer1, data_genitives$Response,
  mode = "prec_recall"
)
Confusion Matrix and Statistics

          Reference
Prediction   of    s
        of 8903 1427
        s   819 2649
                                         
               Accuracy : 0.8372         
                 95% CI : (0.831, 0.8433)
    No Information Rate : 0.7046         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.5913         
                                         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
              Precision : 0.8619         
                 Recall : 0.9158         
                     F1 : 0.8880         
             Prevalence : 0.7046         
         Detection Rate : 0.6452         
   Detection Prevalence : 0.7487         
      Balanced Accuracy : 0.7828         
                                         
       'Positive' Class : of             
                                         

Area under ROC curve.

data.frame(
  obs = data_genitives$Response,
  pred = data_genitives$predicted_glmer1,
  of = 1 - data_genitives$fitted_glmer1,
  s = data_genitives$fitted_glmer1
) |>
  caret::twoClassSummary(lev = levels(data_genitives$Response))
      ROC      Sens      Spec 
0.8956718 0.9157581 0.6499019 

Check multicollinearity. Lower scores are better

kappa_mer(glmer_genitives1)
[1] 22.89032
performance::check_collinearity(glmer_genitives1)
# Check for Multicollinearity

Low Correlation

             Term  VIF       VIF 95% CI Increased SE Tolerance Tolerance 95% CI
        PorLength 1.07 [  1.05,   1.09]         1.04      0.93     [0.91, 0.95]
        PumLength 1.06 [  1.04,   1.08]         1.03      0.95     [0.93, 0.96]
 PorNPexprTypeBin 1.07 [  1.05,   1.09]         1.03      0.93     [0.92, 0.95]

Moderate Correlation

              Term  VIF       VIF 95% CI Increased SE Tolerance
     PorAnimacyBin 8.45 [  8.19,   8.72]         2.91      0.12
 PorFinalSibilancy 7.69 [  7.45,   7.93]         2.77      0.13
 Tolerance 95% CI
     [0.11, 0.12]
     [0.13, 0.13]

High Correlation

                           Term    VIF       VIF 95% CI Increased SE Tolerance
                   variety_of_E  44.44 [ 43.01,  45.93]         6.67      0.02
     PorAnimacyBin:variety_of_E 223.87 [216.56, 231.44]        14.96  4.47e-03
 PorFinalSibilancy:variety_of_E  29.37 [ 28.43,  30.35]         5.42      0.03
 Tolerance 95% CI
     [0.02, 0.02]
     [0.00, 0.00]
     [0.03, 0.04]

Check for overdispersion.

performance::check_overdispersion(glmer_genitives1)
# Overdispersion test

       dispersion ratio =     0.950
  Pearson's Chi-Squared = 13080.509
                p-value =         1

Create Table 5.2 of regression model effects in the genitive alternation.

parameters::model_parameters(glmer_genitives1, effects = "fixed", verbose = FALSE) |>
  filter(p < 0.05) |>
  as.data.frame() |>
  select(c(1:3, 7, 9)) |>
  knitr::kable(digits = 3) |>
  kableExtra::kable_styling()
Table 3.2: The genitive alternation: fixed effect coefficients in mixed-effects logistic regression analysis. Predicted odds are for the s-genitive. Percent correctly predicted: 83.7 percent (baseline: 70.5 percent). C = 0.90. κ = 22.9. Only significant coefficients are shown.
Parameter Coefficient SE z p
(Intercept) -0.994 0.161 -6.173 0.000
PorAnimacyBinanimate 2.057 0.160 12.879 0.000
PorLength -0.667 0.027 -24.911 0.000
PumLength 0.582 0.023 25.718 0.000
PorNPexprTypeBincommon_noun -0.954 0.057 -16.599 0.000
PorFinalSibilancyfinal_sibilant_present -0.951 0.198 -4.803 0.000
variety_of_ECanE 0.452 0.137 3.313 0.001
variety_of_EHKE 0.744 0.128 5.828 0.000
variety_of_ENZE 0.506 0.129 3.926 0.000
variety_of_EPhlE 0.327 0.131 2.483 0.013
variety_of_ESgE 0.567 0.139 4.064 0.000
PorAnimacyBinanimate:variety_of_EHKE -0.714 0.233 -3.058 0.002
PorAnimacyBinanimate:variety_of_EPhlE -1.056 0.210 -5.037 0.000
PorFinalSibilancyfinal_sibilant_present:variety_of_ECanE -0.739 0.312 -2.370 0.018
PorFinalSibilancyfinal_sibilant_present:variety_of_EHKE -1.354 0.339 -3.989 0.000
PorFinalSibilancyfinal_sibilant_present:variety_of_ESgE -0.750 0.309 -2.424 0.015

Now we plot the partial effects plot for PorAnimacyBin:variety_of_E (5.3) and PorFinalSibilancy:variety_of_E interactions:

Figure 5.3

Code
gen_var_anim_partial_eff <- effects::Effect(c("variety_of_E", "PorAnimacyBin"), glmer_genitives1) |>
  as.data.frame() |>
  mutate(
    variety_of_E = factor(variety_of_E, levels = var_levs),
    PorAnimacyBin = relevel(PorAnimacyBin, ref = "animate")
  )

gen_var_anim_partial_eff |>
  ggplot(aes(variety_of_E, fit, shape = PorAnimacyBin)) +
  geom_vline(aes(xintercept = variety_of_E), color = "gray", linetype = "dashed") +
  geom_errorbar(aes(ymin = lower, ymax = upper),
    width = .2, position = position_dodge(width = .4)
  ) +
  annotate("rect",
    xmin = c(0.5), xmax = c(1.5), ymin = -Inf, ymax = Inf,
    alpha = 0.2, fill = c("gray")
  ) +
  geom_point(size = 4, position = position_dodge(width = .4), fill = "white") +
  labs(y = "Predicted probability of s-genitive", x = "Variety", title = "") +
  scale_shape_manual(name = "PorAnimacyBin", values = c(21, 16)) +
  theme_cup() +
  theme( # plot.title = element_text(size=20, face="bold", vjust=2),
    plot.title = element_blank(),
    axis.title.x = element_text(size = 14, vjust = -0.5),
    axis.text.x = element_text(size = 14, angle = 90, hjust = 1, vjust = 0.5),
    axis.title.y = element_text(size = 14, vjust = 1.5),
    legend.position = "top"
  ) +
  scale_y_continuous(label = scales::percent)

Figure 3.3: Partial effects plot of the interaction of Variety_of_E with PorAnimacyBin in the genitive regression model. Predicted probabilities (vertical axis, expressed in %) are for the s-genitive. Vertical distance between dots is proportional to effect size. The plotted probabilities obtain when all other categorical constraints are set at their default levels, or at 0 in the case of numerical constraints. BrE (shaded) serves as the reference variety in the underlying regression model. Varieties to the left are Inner Circle varieties, varieties to the right are Outer Circle varieties.

Figure 5.4

Code
gen_var_sib_partial_eff <- effects::Effect(c("variety_of_E", "PorFinalSibilancy"), glmer_genitives1) |>
  as.data.frame() |>
  mutate(
    variety_of_E = factor(variety_of_E, levels = var_levs)
  )
gen_var_sib_partial_eff |>
  ggplot(aes(variety_of_E, fit, shape = PorFinalSibilancy)) +
  geom_vline(aes(xintercept = variety_of_E), color = "gray", linetype = "dashed") +
  geom_errorbar(aes(ymin = lower, ymax = upper),
    width = .2, position = position_dodge(width = .4)
  ) +
  annotate("rect",
    xmin = c(0.5), xmax = c(1.5), ymin = -Inf, ymax = Inf,
    alpha = 0.2, fill = c("gray")
  ) +
  geom_point(size = 4, position = position_dodge(width = .4), fill = "white") +
  labs(y = "Predicted probability of s-genitive", x = "Variety", title = "") +
  scale_shape_manual(name = "PorFinalSibilancy", values = c(16, 21)) +
  theme_cup() +
  theme( # plot.title = element_text(size=20, face="bold", vjust=2),
    plot.title = element_blank(),
    axis.title.x = element_text(size = 14, vjust = -0.5),
    axis.text.x = element_text(size = 14, angle = 90, hjust = 1, vjust = 0.5),
    axis.title.y = element_text(size = 14, vjust = 1.5),
    legend.position = "top"
  ) +
  scale_y_continuous(label = scales::percent)

Figure 3.4: Partial effects plot of the interaction of Variety_of_E with PorFinalSibilancy in the genitive regression model. Predicted probabilities (vertical axis, expressed in %) are for the s-genitive. Vertical distance between dots is proportional to effect size. The plotted probabilities obtain when all other categorical constraints are set at their default levels, or at 0 in the case of numerical constraints. BrE (shaded) serves as the reference variety in the underlying regression model. Varieties to the left are Inner Circle varieties, varieties to the right are Outer Circle varieties.

5.3 Dative alternation

Cross tabulate response and variety of English.

table(data_datives$variety_of_E, data_datives$Response) |>
  make_prop_table() |>
  kable() |>
  kable_styling()
Table 3.3: Variant rates of dative constructions across varieties of English. Upper half of the table: Inner Circle varieties of English; lower half of the table: Outer Circle varieties of English.
do % pd % Total
BrE 932 70.7 386 29.3 1318
CanE 968 71.3 389 28.7 1357
IrE 924 72.8 346 27.2 1270
NZE 1056 71.1 429 28.9 1485
HKE 1129 63.9 639 36.1 1768
IndE 912 58.8 638 41.2 1550
JamE 977 71.1 398 28.9 1375
PhlE 995 66.1 510 33.9 1505
SgE 1094 70.9 449 29.1 1543
Total 8987 NA 4184 NA 13171

Variable importance in global random forest model

Now fit random forest to the dataset. Again, save output for reloading.

# formula
dat_f1 <- Response ~
  logWeightRatio +
  RecPron +
  ThemeBinComplexity +
  ThemeHeadFreq +
  ThemePron +
  ThemeDefiniteness +
  RecGivenness +
  RecHeadFreq
dat_f2 <- update(dat_f1, . ~ . + Variety + Circle + GenreCoarse)
dat_f3 <- update(dat_f1, . ~ . + GenreCoarse)

Run the forest.

if (!file.exists(here("model_output", "dat_forest1.rds"))) {
  dat_forest1 <- party::cforest(dat_f2, data_datives)
  saveRDS(dat_forest1, here("model_output", "dat_forest1.rds"))
} else {
  dat_forest1 <- here("model_output", "dat_forest1.rds") |>
    readRDS()
}
invisible(gc(verbose = FALSE)) # garbage collect to save RAM

Get predictions.

if (!file.exists(here("model_output", "dat_forest1_preds.rds"))) {
  dat_forest1_preds <- unlist(party::treeresponse(dat_forest1))[c(FALSE, TRUE)]
  saveRDS(dat_forest1_preds, here("model_output", "dat_forest1_preds.rds"))
} else {
  dat_forest1_preds <- here("model_output", "dat_forest1_preds.rds") |>
    readRDS()
}
invisible(gc(verbose = FALSE))

Check predictive performance.

data_datives$fitted_crf1 <- dat_forest1_preds
data_datives$predicted_crf1 <- as.factor(ifelse(data_datives$fitted_crf1 >= .5, "pd", "do"))
caret::confusionMatrix(data_datives$predicted_crf1, data_datives$Response,
  mode = "prec_recall"
)
Confusion Matrix and Statistics

          Reference
Prediction   do   pd
        do 6132 2951
        pd 2855 1233
                                          
               Accuracy : 0.5592          
                 95% CI : (0.5507, 0.5677)
    No Information Rate : 0.6823          
    P-Value [Acc > NIR] : 1.0000          
                                          
                  Kappa : -0.0231         
                                          
 Mcnemar's Test P-Value : 0.2125          
                                          
              Precision : 0.6751          
                 Recall : 0.6823          
                     F1 : 0.6787          
             Prevalence : 0.6823          
         Detection Rate : 0.4656          
   Detection Prevalence : 0.6896          
      Balanced Accuracy : 0.4885          
                                          
       'Positive' Class : do              
                                          

Area under ROC curve.

data.frame(
  obs = data_datives$Response,
  pred = data_datives$predicted_crf1,
  do = 1 - data_datives$fitted_crf1,
  pd = data_datives$fitted_crf1
) |>
  caret::twoClassSummary(lev = levels(data_datives$Response))
      ROC      Sens      Spec 
0.5020562 0.6823189 0.2946941 

Calculate (unconditional) variable importance.

if (!file.exists(here("model_output", "dat_forest1_varimp.rds"))) {
  dat_forest1_varimp <- permimp(dat_forest1, conditional = FALSE, progressBar = FALSE, asParty = TRUE)
  saveRDS(dat_forest1_varimp, here("model_output", "dat_forest1_varimp.rds"))
} else {
  dat_forest1_varimp <- here("model_output", "dat_forest1_varimp.rds") |>
    readRDS()
}

Create Figure 5.5

dat_forest1_varimp$values |>
  as.data.frame() |>
  rename(Varimp = "dat_forest1_varimp$values") |>
  rownames_to_column("Variable") |>
  ggplot(aes(reorder(Variable, Varimp), Varimp)) +
  geom_hline(yintercept = 0) +
  geom_segment(aes(xend = Variable, yend = 0), col = "black") +
  geom_point(col = "black") +
  coord_flip() +
  theme_cup() +
  labs(x = "", y = "Variable Importance")

Figure 3.5: CRF permutation variable importance ranking for the pooled dative dataset. C = 0.97.

Variable importance in by-variety random forest models

Next we fit CRFs to each variety individually, and calculate variable importance per variety.

data_datives_split <- split(data_datives, data_datives$variety_of_E)

Fit random forest for each variety.

if (file.exists(here("model_output", "datives_variety_forests.rds"))) {
  datives_variety_forests <- readRDS(
    here("model_output", "datives_variety_forests.rds")
  )
} else {
  datives_variety_forests <- lapply(
    data_datives_split, function(x) fit_party_crf(dat_f3, x)
  )
  saveRDS(datives_variety_forests, here("model_output", "datives_variety_forests.rds"))
}

Get the variable importances.

if (file.exists(here("model_output", "datives_variety_varimps.rds"))) {
  datives_variety_varimps <- readRDS(
    here("model_output", "datives_variety_varimps.rds")
  )
} else {
  datives_variety_varimps <- lapply(
    datives_variety_forests, function(x) get_party_varimp(x)$values
  )
  saveRDS(datives_variety_varimps, here("model_output", "datives_variety_varimps.rds"))
}

Create Figure 5.6

Code
preds <- datives_variety_varimps[[1]] |>
  sort() |>
  names()

lapply(
  seq_along(datives_variety_varimps),
  function(i) {
    x <- datives_variety_varimps[[i]]
    x |>
      as.data.frame() |>
      rename(Varimp = 1) |>
      rownames_to_column("Variable") |>
      mutate(Variable = factor(Variable, levels = preds)) |>
      # bind_rows() |>
      # mutate(
      #   Variety = rep(names(genitives_variety_varimps), each = 9)
      # ) |>
      ggplot(aes(Variable, Varimp)) +
      geom_hline(yintercept = 0) +
      geom_segment(aes(xend = Variable, yend = 0), col = "black") +
      geom_point(col = "black") +
      coord_flip() +
      # facet_wrap(~Variety, ncol = 3) +
      theme_cup() +
      labs(x = "", y = "", title = names(datives_variety_varimps)[i])
  }
) |>
  wrap_plots(ncol = 3) +
  plot_annotation(caption = "Click to enlarge")

Figure 3.6: CRF permutation variable importance ranking of constraints on the dative alternation by variety of English.

Variety interactions in mixed-effects regression model

Next fit a generalized mixed-effects regression model to the data.

Set reference levels.

data_datives$Response <- relevel(data_datives$Response, ref = "do")
data_datives$ThemeBinComplexity <- relevel(data_datives$ThemeBinComplexity, ref = "simple") # Unlike in Mel's PhD
data_datives$RecPron <- relevel(data_datives$RecPron, ref = "non-pron") # unlike in Mel's PhD
data_datives$ThemePron <- relevel(data_datives$ThemePron, ref = "non-pron") # as with RecPron

Formula.

f_reg_dats <- Response ~
  logWeightRatio +
  RecPron +
  ThemeDefiniteness +
  ThemeHeadFreq +
  ThemePron +
  variety_of_E +

  logWeightRatio:variety_of_E + # According to Melanie's PhD, interacts with Variety
  RecPron:variety_of_E + # According to Melanie's PhD, interacts with Variety

  (1 | FileID_pruned) +
  (1 | Verb_pruned)

Fit mixed-effects logistic regression model.

glmer_ctrl <- glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 1e7))

if (!file.exists(here("model_output", "dat_glmer1.rds"))) {
  glmer_datives1 <- glmer(f_reg_dats, data_datives, family = binomial, control = glmer_ctrl)
  saveRDS(glmer_datives1, here("model_output", "dat_glmer1.rds"))
} else {
  glmer_datives1 <- readRDS(here("model_output", "dat_glmer1.rds"))
}

Performance measures of model.

model_performance(glmer_datives1)
Random effect variances not available. Returned R2 does not account for random effects.
# Indices of model performance

AIC      |     AICc |      BIC | R2 (cond.) | R2 (marg.) |  RMSE | Sigma | Log_loss | Score_log | Score_spherical
-----------------------------------------------------------------------------------------------------------------
7164.061 | 7164.221 | 7403.605 |            |      0.683 | 0.284 | 1.000 |    0.263 |      -Inf |       7.085e-04

Precision and recall:

data_datives$fitted_glmer1 <- fitted(glmer_datives1)
data_datives$predicted_glmer1 <- as.factor(ifelse(data_datives$fitted_glmer1 >= .5, "pd", "do"))
caret::confusionMatrix(data_datives$predicted_glmer1, data_datives$Response,
  mode = "prec_recall"
)
Confusion Matrix and Statistics

          Reference
Prediction   do   pd
        do 8348  840
        pd  639 3344
                                          
               Accuracy : 0.8877          
                 95% CI : (0.8822, 0.8931)
    No Information Rate : 0.6823          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.7376          
                                          
 Mcnemar's Test P-Value : 1.987e-07       
                                          
              Precision : 0.9086          
                 Recall : 0.9289          
                     F1 : 0.9186          
             Prevalence : 0.6823          
         Detection Rate : 0.6338          
   Detection Prevalence : 0.6976          
      Balanced Accuracy : 0.8641          
                                          
       'Positive' Class : do              
                                          

Area under ROC curve:

data.frame(
  obs = data_datives$Response,
  pred = data_datives$predicted_glmer1,
  do = 1 - data_datives$fitted_glmer1,
  pd = data_datives$fitted_glmer1
) |>
  caret::twoClassSummary(lev = levels(data_datives$Response))
      ROC      Sens      Spec 
0.9501111 0.9288973 0.7992352 

Check for multicollinearity.

kappa_mer(glmer_datives1)
[1] 41.94774
check_collinearity(glmer_datives1)
# Check for Multicollinearity

Low Correlation

              Term  VIF       VIF 95% CI Increased SE Tolerance
 ThemeDefiniteness 1.15 [  1.13,   1.17]         1.07      0.87
     ThemeHeadFreq 1.65 [  1.61,   1.69]         1.28      0.61
         ThemePron 1.45 [  1.42,   1.48]         1.20      0.69
 Tolerance 95% CI
     [0.85, 0.89]
     [0.59, 0.62]
     [0.67, 0.70]

High Correlation

                        Term    VIF       VIF 95% CI Increased SE Tolerance
              logWeightRatio  11.17 [ 10.81,  11.54]         3.34      0.09
                     RecPron  10.75 [ 10.41,  11.11]         3.28      0.09
                variety_of_E  39.09 [ 37.80,  40.43]         6.25      0.03
 logWeightRatio:variety_of_E 222.30 [214.87, 229.99]        14.91  4.50e-03
        RecPron:variety_of_E 811.50 [784.34, 839.61]        28.49  1.23e-03
 Tolerance 95% CI
     [0.09, 0.09]
     [0.09, 0.10]
     [0.02, 0.03]
     [0.00, 0.00]
     [0.00, 0.00]

Check for overdispersion.

performance::check_overdispersion(glmer_datives1)
# Overdispersion test

       dispersion ratio =     0.914
  Pearson's Chi-Squared = 12010.225
                p-value =         1

Create Table 5.4 of regression model effects in the dative alternation.

model_parameters(glmer_datives1, effects = "fixed", verbose = FALSE) |>
  filter(p < 0.05) |>
  as.data.frame() |>
  select(c(1:3, 7, 9)) |>
  kable(digits = 3) |>
  kable_styling()
Table 3.4: The dative alternation: fixed effect coefficients in mixed-effects logistic regression analysis. Predicted odds are for the prepositional dative construction. Percent correctly predicted: 88.8 percent (baseline: 68.2 percent). C = 0.95. κ = 42.1. Only significant coefficients are shown.
Parameter Coefficient SE z p
(Intercept) 1.255 0.365 3.442 0.001
logWeightRatio 1.491 0.127 11.717 0.000
RecPronpron -1.751 0.231 -7.570 0.000
ThemeDefinitenessdef 0.656 0.073 8.987 0.000
ThemeHeadFreq 0.000 0.000 5.211 0.000
ThemePronpron 1.227 0.145 8.478 0.000
variety_of_EHKE 0.557 0.161 3.473 0.001
variety_of_EIndE 1.222 0.167 7.311 0.000
variety_of_EPhlE 0.497 0.167 2.970 0.003
logWeightRatio:variety_of_EIndE -0.338 0.169 -1.993 0.046
RecPronpron:variety_of_EIndE -0.934 0.307 -3.039 0.002

Create Figure 5.7

Code
dat_var_anim_partial_eff <- ggeffect(glmer_datives1, c("variety_of_E", "RecPron")) |>
  as.data.frame() |>
  mutate(
    variety_of_E = factor(x, levels = var_levs)
  )

dat_var_anim_partial_eff |>
  ggplot(aes(variety_of_E, predicted, shape = group)) +
  geom_vline(aes(xintercept = variety_of_E), color = "gray", linetype = "dashed") +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high),
    width = .2, position = position_dodge(width = .4)
  ) +
  annotate("rect",
    xmin = c(0.5), xmax = c(1.5), ymin = -Inf, ymax = Inf,
    alpha = 0.2, fill = c("gray")
  ) +
  geom_point(size = 4, position = position_dodge(width = .4), fill = "white") +
  labs(y = "Predicted probability of s-genitive", x = "Variety", title = "") +
  scale_shape_manual(name = "PorAnimacyBin", values = c(16, 21)) +
  theme_cup() +
  theme( # plot.title = element_text(size=20, face="bold", vjust=2),
    plot.title = element_blank(),
    axis.title.x = element_text(size = 14, vjust = -0.5),
    axis.text.x = element_text(size = 14, angle = 90, hjust = 1, vjust = 0.5),
    axis.title.y = element_text(size = 14, vjust = 1.5),
    legend.position = "top"
  ) +
  scale_y_continuous(label = scales::percent)

Figure 3.7: Partial effects plot of the interaction of Variety_of_E with RecPron in the genitive regression model. Predicted probabilities (vertical axis, expressed in %) are for the s-genitive. Vertical distance between dots is proportional to effect size. The plotted probabilities obtain when all other categorical constraints are set at their default levels, or at 0 in the case of numerical constraints. BrE (shaded) serves as the reference variety in the underlying regression model. Varieties to the left are Inner Circle varieties, varieties to the right are Outer Circle varieties.

5.4 Particle placement alternation

Cross tabulate response and variety of English.

table(data_particle_verbs$variety_of_E, data_particle_verbs$Response) |>
  make_prop_table() |>
  kable() |>
  kable_styling()
Table 3.5: Variant rates of particle placement constructions across varieties of English. Upper half of the table: Inner Circle varieties of English; lower half of the table: Outer Circle varieties of English.
Continuous % Split % Total
BrE 939 70.9 385 29.1 1324
CanE 964 69.3 428 30.7 1392
IrE 989 73.4 358 26.6 1347
NZE 1059 68.6 484 31.4 1543
HKE 1028 85.1 180 14.9 1208
IndE 977 90.2 106 9.8 1083
JamE 944 85.4 162 14.6 1106
PhlE 868 86.5 136 13.5 1004
SgE 1130 84.8 203 15.2 1333
Total 8898 NA 2442 NA 11340

Variable importance in Global random forest model

Now fit conditional random forest to the datasets.

# formula
pv_f1 <- Response ~
  DirObjLettLength +
  DirObjDefiniteness +
  DirObjGivenness +
  DirObjConcreteness +
  DirObjThematicity +
  DirectionalPP +
  Semantics +
  Surprisal.P

pv_f2 <- update(pv_f1, . ~ . + variety_of_E + Circle + GenreCoarse)
pv_f3 <- update(pv_f1, . ~ . + GenreCoarse)

Run the forest.

if (!file.exists(here("model_output", "pv_forest1.rds"))) {
  pv_forest1 <- party::cforest(pv_f2, data_particle_verbs)
  saveRDS(pv_forest1, here("model_output", "pv_forest1.rds"))
} else {
  pv_forest1 <- here("model_output", "pv_forest1.rds") |>
    readRDS()
}
invisible(gc(verbose = FALSE)) # garbage collect to save RAM

Get predictions.

if (!file.exists(here("model_output", "pv_forest1_preds.rds"))) {
  pv_forest1_preds <- unlist(party::treeresponse(pv_forest1))[c(FALSE, TRUE)]
  saveRDS(pv_forest1_preds, here("model_output", "pv_forest1_preds.rds"))
} else {
  pv_forest1_preds <- here("model_output", "pv_forest1_preds.rds") |>
    readRDS()
}
invisible(gc(verbose = FALSE))

Check predictive performance.

data_particle_verbs$fitted_crf1 <- pv_forest1_preds
data_particle_verbs$predicted_crf1 <- as.factor(ifelse(data_particle_verbs$fitted_crf1 >= .5, "Split", "Continuous"))
caret::confusionMatrix(data_particle_verbs$predicted_crf1, data_particle_verbs$Response,
  mode = "prec_recall"
)
Confusion Matrix and Statistics

            Reference
Prediction   Continuous Split
  Continuous       8618  1121
  Split             280  1321
                                          
               Accuracy : 0.8765          
                 95% CI : (0.8703, 0.8825)
    No Information Rate : 0.7847          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5822          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
              Precision : 0.8849          
                 Recall : 0.9685          
                     F1 : 0.9248          
             Prevalence : 0.7847          
         Detection Rate : 0.7600          
   Detection Prevalence : 0.8588          
      Balanced Accuracy : 0.7547          
                                          
       'Positive' Class : Continuous      
                                          

Area under ROC curve.

data.frame(
  obs = data_particle_verbs$Response,
  pred = data_particle_verbs$predicted_crf1,
  Continuous = 1 - data_particle_verbs$fitted_crf1,
  Split = data_particle_verbs$fitted_crf1
) |>
  caret::twoClassSummary(lev = levels(data_particle_verbs$Response))
      ROC      Sens      Spec 
0.9389622 0.9685323 0.5409500 

Variable importance.

if (!file.exists(here("model_output", "pv_forest1_varimp.rds"))) {
  pv_forest1_varimp <- permimp(pv_forest1, conditional = FALSE, progressBar = FALSE, asParty = TRUE)
  saveRDS(pv_forest1_varimp, here("model_output", "pv_forest1_varimp.rds"))
} else {
  pv_forest1_varimp <- here("model_output", "pv_forest1_varimp.rds") |>
    readRDS()
}

Create Figure 5.8

pv_forest1_varimp$values |>
  as.data.frame() |>
  rename(Varimp = "pv_forest1_varimp$values") |>
  rownames_to_column("Variable") |>
  ggplot(aes(reorder(Variable, Varimp), Varimp)) +
  geom_hline(yintercept = 0) +
  geom_segment(aes(xend = Variable, yend = 0), col = "black") +
  geom_point(col = "black") +
  coord_flip() +
  theme_cup() +
  labs(x = "", y = "Variable Importance")

Figure 3.8: Conditional Random Forest permutation variable importance ranking for the pooled particle placement dataset. C = 0.94.

Variable importance in by-variety random forest models

Next we fit CRFs to each variety individually.

data_particle_verbs_split <- split(data_particle_verbs, data_particle_verbs$variety_of_E)

Now fit random forest for each variety.

if (file.exists(here("model_output", "pv_variety_forests.rds"))) {
  pv_variety_forests <- readRDS(
    here("model_output", "datives_variety_forests.rds")
  )
} else {
  pv_variety_forests <- lapply(
    data_particle_verbs_split, function(x) fit_party_crf(pv_f3, x)
  )
  saveRDS(pv_variety_forests, here("model_output", "pv_variety_forests.rds"))
}

Now get the variable importances.

if (file.exists(here("model_output", "pv_variety_varimps.rds"))) {
  pv_variety_varimps <- readRDS(
    here("model_output", "pv_variety_varimps.rds")
  )
} else {
  pv_variety_varimps <- lapply(
    pv_variety_forests, function(x) get_party_varimp(x)$values
  )
  saveRDS(pv_variety_varimps, here("model_output", "pv_variety_varimps.rds"))
}

Create Figure 5.9

Code
preds <- pv_variety_varimps[[1]] |>
  sort() |>
  names()

lapply(
  seq_along(pv_variety_varimps),
  function(i) {
    x <- pv_variety_varimps[[i]]
    x |>
      as.data.frame() |>
      rename(Varimp = 1) |>
      rownames_to_column("Variable") |>
      mutate(Variable = factor(Variable, levels = preds)) |>
      # bind_rows() |>
      # mutate(
      #   Variety = rep(names(genitives_variety_varimps), each = 9)
      # ) |>
      ggplot(aes(Variable, Varimp)) +
      geom_hline(yintercept = 0) +
      geom_segment(aes(xend = Variable, yend = 0), col = "black") +
      geom_point(col = "black") +
      coord_flip() +
      # facet_wrap(~Variety, ncol = 3) +
      theme_cup() +
      labs(x = "", y = "", title = names(pv_variety_varimps)[i])
  }
) |>
  wrap_plots(ncol = 3) +
  plot_annotation(caption = "Click to enlarge")

Figure 3.9: Conditional Random Forest permutation variable importance ranking of constraints on the particle placement alternation by variety of English.

Variety interactions in mixed effects regression model

Next fit a generalized mixed-effects regression model to the data.

data_particle_verbs$Response <- relevel(data_particle_verbs$Response, ref = "Continuous") # predicted odds are for the split pattern
data_particle_verbs$Semantics <- relevel(data_particle_verbs$Semantics, ref = "non-compositional")
data_particle_verbs$DirectionalPP <- relevel(data_particle_verbs$DirectionalPP, ref = "no")
data_particle_verbs$Circle <- relevel(data_particle_verbs$Circle, ref = "Inner")

Formula.

f_reg_pv <- Response ~
  Surprisal.P +
  DirObjLettLength +
  Semantics +
  DirectionalPP +
  DirObjThematicity +
  Circle +

  DirectionalPP:Circle +
  Semantics:Circle +

  (1 | variety_of_E) +
  (1 | FileID_pruned) +
  (1 | VerbPart_pruned)

Fit mixed-effects logistic regression model:

glmer_ctrl <- glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 1e7))

if (!file.exists(here("model_output", "pv_glmer1.rds"))) {
  glmer_particle_verbs1 <- glmer(f_reg_pv, data_particle_verbs, family = binomial, control = glmer_ctrl)
  saveRDS(glmer_particle_verbs1, here("model_output", "pv_glmer1.rds"))
} else {
  glmer_particle_verbs1 <- readRDS(here("model_output", "pv_glmer1.rds"))
}

Performance measures of model.

model_performance(glmer_particle_verbs1)
# Indices of model performance

AIC      |     AICc |      BIC | R2 (cond.) | R2 (marg.) |   ICC |  RMSE | Sigma | Log_loss | Score_log | Score_spherical
-------------------------------------------------------------------------------------------------------------------------
8055.950 | 8055.977 | 8143.983 |      0.598 |      0.302 | 0.424 | 0.329 | 1.000 |    0.340 |      -Inf |       9.466e-04

Precision and recall.

data_particle_verbs$fitted_glmer1 <- fitted(glmer_particle_verbs1)
data_particle_verbs$predicted_glmer1 <- as.factor(ifelse(data_particle_verbs$fitted_glmer1 >= .5, "Split", "Continuous"))
caret::confusionMatrix(data_particle_verbs$predicted_glmer1, data_particle_verbs$Response,
  mode = "prec_recall"
)
Confusion Matrix and Statistics

            Reference
Prediction   Continuous Split
  Continuous       8431  1275
  Split             467  1167
                                         
               Accuracy : 0.8464         
                 95% CI : (0.8396, 0.853)
    No Information Rate : 0.7847         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.4834         
                                         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
              Precision : 0.8686         
                 Recall : 0.9475         
                     F1 : 0.9064         
             Prevalence : 0.7847         
         Detection Rate : 0.7435         
   Detection Prevalence : 0.8559         
      Balanced Accuracy : 0.7127         
                                         
       'Positive' Class : Continuous     
                                         

Area under ROC curve.

data.frame(
  obs = data_particle_verbs$Response,
  pred = data_particle_verbs$predicted_glmer1,
  Continuous = 1 - data_particle_verbs$fitted_glmer1,
  Split = data_particle_verbs$fitted_glmer1
) |>
  caret::twoClassSummary(lev = levels(data_particle_verbs$Response))
      ROC      Sens      Spec 
0.8783687 0.9475163 0.4778870 

Check multicollinearity

kappa_mer(glmer_particle_verbs1)
[1] 7.148104
performance::check_collinearity(glmer_particle_verbs1)
# Check for Multicollinearity

Low Correlation

                 Term  VIF   VIF 95% CI Increased SE Tolerance Tolerance 95% CI
          Surprisal.P 1.01 [1.00, 1.18]         1.00      0.99     [0.85, 1.00]
     DirObjLettLength 1.02 [1.01, 1.05]         1.01      0.98     [0.95, 0.99]
            Semantics 1.59 [1.55, 1.63]         1.26      0.63     [0.61, 0.64]
        DirectionalPP 1.82 [1.77, 1.87]         1.35      0.55     [0.54, 0.57]
    DirObjThematicity 1.02 [1.00, 1.05]         1.01      0.98     [0.95, 1.00]
               Circle 1.21 [1.18, 1.24]         1.10      0.83     [0.81, 0.84]
 DirectionalPP:Circle 1.85 [1.81, 1.90]         1.36      0.54     [0.52, 0.55]
     Semantics:Circle 1.75 [1.71, 1.80]         1.32      0.57     [0.56, 0.58]

Check overdispersion.

performance::check_overdispersion(glmer_particle_verbs1)
# Overdispersion test

       dispersion ratio =    0.821
  Pearson's Chi-Squared = 9300.361
                p-value =        1

Create Table 5.6 of regression model effects in the particle placement alternation.

model_parameters(glmer_particle_verbs1, effects = "fixed", verbose = FALSE) |>
  filter(Parameter == "(Intercept)" | p < 0.05) |>
  as.data.frame() |>
  select(c(1:3, 7, 9)) |>
  kable(digits = 3) |>
  kable_styling()
Table 3.6: The particle placement alternation: fixed effect coefficients in mixed-effects logistic regression analysis. Predicted odds are for the split variant. Percent correctly predicted: 84.5 percent (baseline: 78.5 percent). C = 0.88. κ = 7.14. Only significant coefficients are shown.
Parameter Coefficient SE z p
(Intercept) -0.944 0.604 -1.564 0.118
Surprisal.P 0.257 0.021 12.548 0.000
DirObjLettLength -0.142 0.006 -22.578 0.000
Semanticscompositional 0.930 0.080 11.602 0.000
DirectionalPPyes 1.493 0.122 12.236 0.000
DirObjThematicity 0.034 0.005 7.037 0.000
CircleOuter -0.801 0.132 -6.085 0.000
DirectionalPPyes:CircleOuter -0.439 0.174 -2.520 0.012
Semanticscompositional:CircleOuter -0.319 0.119 -2.681 0.007

Create Figure 5.10, the barplot of by-variety intercept adjustments.

nms <- rownames(ranef(glmer_particle_verbs1)$variety_of_E)
intercepts <- ranef(glmer_particle_verbs1)$variety_of_E[, 1]
support <- tapply(data_particle_verbs$variety_of_E, data_particle_verbs$variety_of_E, length)
labels <- paste(nms)
barplot(intercepts[order(intercepts)], names.arg = labels[order(intercepts)], horiz = FALSE, las = 2, cex.names = 0.8, family = "serif")

Figure 3.10: Intercept adjustments for the random effect VARIETY_OF_E. Predicted odds are for the split variant

Create Figure 5.11, the partial effects plot for DirectionalPP:Circle interaction.

Code
pv_var_dirPP_partial_eff <- effects::Effect(c("Circle", "DirectionalPP"), glmer_particle_verbs1) |>
  as.data.frame()

pv_var_dirPP_partial_eff |>
  ggplot(aes(Circle, fit, shape = DirectionalPP)) +
  geom_errorbar(aes(ymin = lower, ymax = upper),
    width = .2, position = position_dodge(width = .4)
  ) +
  annotate("rect",
    xmin = c(0.5), xmax = c(1.5), ymin = -Inf, ymax = Inf,
    alpha = 0.2, fill = c("gray")
  ) +
  geom_point(size = 4, position = position_dodge(width = .4), fill = "white") +
  labs(y = "Predicted probability of split variant", x = "Circle", title = "") +
  # scale_shape_manual(name = "PorAnimacyBin", values = c(21, 16)) +
  theme_cup() +
  theme( # plot.title = element_text(size=20, face="bold", vjust=2),
    plot.title = element_blank(),
    axis.title.x = element_text(size = 14, vjust = -0.5),
    axis.text.x = element_text(size = 14, angle = 90, hjust = 1, vjust = 0.5),
    axis.title.y = element_text(size = 14, vjust = 1.5),
    legend.position = "top"
  ) +
  scale_y_continuous(limits = c(0, .9), breaks = seq(0, .9, .1), label = scales::percent)

Figure 3.11: Partial effects plot of the interaction of CIRCLE with DIRECTIONALPP in the particle placement model. Predicted probabilities (vertical axis, expressed in percent) are for the split variant. Vertical distance between dots is proportional to effect size. The plotted probabilities obtain when all other categorical constraints are set at their default levels, or at 0 in the case of numerical constraints. The Inner Circle (shaded) serves as the reference category in the underlying regression model.

Create Figure 5.12, the partial effects plot for DirectionalPP:Semantics interaction.

Code
pv_var_semantics_partial_eff <- effects::Effect(c("Circle", "Semantics"), glmer_particle_verbs1) |>
  as.data.frame() |>
  mutate(
    Semantics = factor(Semantics, levels = c("compositional", "non-compositional"))
  )

pv_var_semantics_partial_eff |>
  ggplot(aes(Circle, fit, shape = Semantics)) +
  geom_errorbar(aes(ymin = lower, ymax = upper),
    width = .2, position = position_dodge(width = .4)
  ) +
  annotate("rect",
    xmin = c(0.5), xmax = c(1.5), ymin = -Inf, ymax = Inf,
    alpha = 0.2, fill = c("gray")
  ) +
  geom_point(size = 4, position = position_dodge(width = .4), fill = "white") +
  labs(y = "Predicted probability of split variant", x = "Circle", title = "") +
  # scale_shape_manual(name = "PorAnimacyBin", values = c(21, 16)) +
  theme_cup() +
  theme( # plot.title = element_text(size=20, face="bold", vjust=2),
    plot.title = element_blank(),
    axis.title.x = element_text(size = 14, vjust = -0.5),
    axis.text.x = element_text(size = 14, angle = 90, hjust = 1, vjust = 0.5),
    axis.title.y = element_text(size = 14, vjust = 1.5),
    legend.position = "top"
  ) +
  scale_y_continuous(limits = c(0, .9), breaks = seq(0, .9, .1), label = scales::percent)

Figure 3.12: Partial effects plot of the interaction of CIRCLE with SEMANTICS in the particle placement model. Predicted probabilities (vertical axis, expressed in percent) are for the split variant. Vertical distance between dots is proportional to effect size. The plotted probabilities obtain when all other categorical constraints are set at their default levels, or at 0 in the case of numerical constraints. The Inner Circle (shaded) serves as the reference category in the underlying regression model.

Interaction of binary length in mixed effects regression model

Fit model with binned length.

data_particle_verbs <- data_particle_verbs |>
  mutate(
    DirObjLettLength_bin = ifelse(DirObjLettLength <= 11, "short", "long") |>
      as.factor()
  )
data_particle_verbs$DirObjLettLength_bin <- relevel(data_particle_verbs$DirObjLettLength_bin, ref = "short")

New model.

f_reg_pv2 <- Response ~
  Surprisal.P +
  DirObjLettLength_bin +
  Semantics +
  DirectionalPP +
  DirObjThematicity +
  Circle +

  DirObjLettLength_bin:Circle +
  DirectionalPP:Circle +
  Semantics:Circle +

  (1 | variety_of_E) +
  (1 | FileID_pruned) +
  (1 | VerbPart_pruned)

if (!file.exists(here("model_output", "pv_glmer2.rds"))) {
  glmer_particle_verbs2 <- glmer(f_reg_pv2, data_particle_verbs, family = binomial, control = glmer_ctrl)
  saveRDS(glmer_particle_verbs2, here("model_output", "pv_glmer2.rds"))
} else {
  glmer_particle_verbs2 <- readRDS(here("model_output", "pv_glmer2.rds"))
}

Performance measures of model.

model_performance(glmer_particle_verbs2)
# Indices of model performance

AIC      |     AICc |      BIC | R2 (cond.) | R2 (marg.) |   ICC |  RMSE | Sigma | Log_loss | Score_log | Score_spherical
-------------------------------------------------------------------------------------------------------------------------
8143.988 | 8144.020 | 8239.357 |      0.571 |      0.242 | 0.433 | 0.330 | 1.000 |    0.344 |      -Inf |       6.577e-04

Precision and recall.

data_particle_verbs$fitted_glmer2 <- fitted(glmer_particle_verbs2)
data_particle_verbs$predicted_glmer2 <- as.factor(ifelse(data_particle_verbs$fitted_glmer2 >= .5, "Split", "Continuous"))
caret::confusionMatrix(data_particle_verbs$predicted_glmer2, data_particle_verbs$Response,
  mode = "prec_recall"
)
Confusion Matrix and Statistics

            Reference
Prediction   Continuous Split
  Continuous       8431  1310
  Split             467  1132
                                          
               Accuracy : 0.8433          
                 95% CI : (0.8365, 0.8499)
    No Information Rate : 0.7847          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.4699          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
              Precision : 0.8655          
                 Recall : 0.9475          
                     F1 : 0.9047          
             Prevalence : 0.7847          
         Detection Rate : 0.7435          
   Detection Prevalence : 0.8590          
      Balanced Accuracy : 0.7055          
                                          
       'Positive' Class : Continuous      
                                          

Area under ROC curve.

data.frame(
  obs = data_particle_verbs$Response,
  pred = data_particle_verbs$predicted_glmer2,
  Continuous = 1 - data_particle_verbs$fitted_glmer2,
  Split = data_particle_verbs$fitted_glmer2
) |>
  caret::twoClassSummary(lev = levels(data_particle_verbs$Response))
      ROC      Sens      Spec 
0.8750134 0.9475163 0.4635545 

Check multicollinearity

kappa_mer(glmer_particle_verbs2)
[1] 8.440186
performance::check_collinearity(glmer_particle_verbs2)
# Check for Multicollinearity

Low Correlation

                        Term  VIF   VIF 95% CI Increased SE Tolerance
                 Surprisal.P 1.01 [1.00, 1.20]         1.00      0.99
        DirObjLettLength_bin 1.61 [1.57, 1.65]         1.27      0.62
                   Semantics 1.57 [1.54, 1.62]         1.25      0.64
               DirectionalPP 1.79 [1.75, 1.84]         1.34      0.56
           DirObjThematicity 1.01 [1.00, 1.06]         1.01      0.99
                      Circle 1.28 [1.25, 1.31]         1.13      0.78
 DirObjLettLength_bin:Circle 1.67 [1.62, 1.71]         1.29      0.60
        DirectionalPP:Circle 1.83 [1.78, 1.88]         1.35      0.55
            Semantics:Circle 1.73 [1.68, 1.77]         1.31      0.58
 Tolerance 95% CI
     [0.84, 1.00]
     [0.60, 0.64]
     [0.62, 0.65]
     [0.54, 0.57]
     [0.94, 1.00]
     [0.76, 0.80]
     [0.58, 0.62]
     [0.53, 0.56]
     [0.56, 0.59]

Check overdispersion.

performance::check_overdispersion(glmer_particle_verbs2)
# Overdispersion test

       dispersion ratio =    0.848
  Pearson's Chi-Squared = 9599.862
                p-value =        1

Create Figure 5.13, the partial effects plot for Circle x DirObj length interaction.

Code
pv_var_length_partial_eff <- effects::Effect(c("Circle", "DirObjLettLength_bin"), glmer_particle_verbs2) |>
  as.data.frame() |>
  mutate(
    DirObjLettLength_bin = factor(DirObjLettLength_bin, levels = c("long", "short"))
  )

pv_var_length_partial_eff |>
  ggplot(aes(Circle, fit, shape = DirObjLettLength_bin)) +
  geom_errorbar(aes(ymin = lower, ymax = upper),
    width = .2, position = position_dodge(width = .4)
  ) +
  annotate("rect",
    xmin = c(0.5), xmax = c(1.5), ymin = -Inf, ymax = Inf,
    alpha = 0.2, fill = c("gray")
  ) +
  geom_point(size = 4, position = position_dodge(width = .4), fill = "white") +
  labs(y = "Predicted probability of split variant", x = "Circle", title = "") +
  # scale_shape_manual(name = "PorAnimacyBin", values = c(21, 16)) +
  theme_cup() +
  theme( # plot.title = element_text(size=20, face="bold", vjust=2),
    plot.title = element_blank(),
    axis.title.x = element_text(size = 14, vjust = -0.5),
    axis.text.x = element_text(size = 14, angle = 90, hjust = 1, vjust = 0.5),
    axis.title.y = element_text(size = 14, vjust = 1.5),
    legend.position = "top"
  ) +
  scale_y_continuous(limits = c(0, .9), breaks = seq(0, .9, .1), label = scales::percent)

Figure 3.13: Partial effects plot of the interaction of Circle with DirObjLettLength_bin in the particle placement model. Predicted probabilities (vertical axis, expressed in percent) are for the split variant. Vertical distance between dots is proportional to effect size. The plotted probabilities obtain when all other categorical constraints are set at their default levels, or at 0 in the case of numerical constraints. The Inner Circle (shaded) serves as the reference category in the underlying regression model.

Remove large model objects that we don’t need to clean up workspace (optional).

rm(
  gen_forest1, genitives_variety_forests, glmer_genitives1,
  dat_forest1, datives_variety_forests, glmer_datives1,
  pv_forest1, pv_variety_forests, glmer_particle_verbs1, glmer_particle_verbs2
)
invisible(gc(verbose = FALSE)) # free unused memory

Chapter 6

In this chapter we run the Variation-Based Distance & Similarity Modeling (VADIS) analyses of the three alternations. Note that because line 3 variable importance scores are based on random forest models, results will vary from run to run. We recommend tuning models before running, and setting random seeds for exact reproducibility (not done here).

6.4 Quantification of Similarity Coefficients

The code here shows only the analysis for the entire datasets. See @#sec-append-6.4 in the Appendix for the code for the subsets of the data (spoken/written only, Inner/Outer Circle only).

All available data

Start with the GENITIVE alternation. First run the by-variety regression models.

# Add random effects
gen_f4 <- update(gen_f1, . ~ . + (1 | PorHeadLemma_pruned) + (1 | GenreCoarse))

if (!file.exists(here("model_output", "gen_glmer_list.rds"))) {
  genitives_variety_glmer <- vector("list")
  for (i in seq_along(data_genitives_split)) {
    d <- data_genitives_split[[i]]
    # standardize the model inputs, excluding the response and random effects
    d_std <- VADIS::stand(d, cols = gen_f4) # use the fitting function for convenience
    # fit the model
    genitives_variety_glmer[[i]] <- glmer(gen_f4,
      data = d_std, family = binomial, # set optimizer controls to help convergence
      control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 1e7))
    )
    rm(d, d_std) # remove datasets
  }
  names(genitives_variety_glmer) <- names(data_genitives_split)
  saveRDS(genitives_variety_glmer, here("model_output", "gen_glmer_list.rds"))
} else {
  genitives_variety_glmer <- readRDS(here("model_output", "gen_glmer_list.rds"))
}

Check models.

round(VADIS::summary_stats(genitives_variety_glmer), 3)
        N baseline predicted.corr Brier     C LogScore      AIC Max.VIF kappa
BrE  1612    0.729          0.854 0.102 0.910    0.328 1163.020   1.205 1.958
CanE 1360    0.616          0.854 0.108 0.920    0.346 1018.358   1.228 2.243
IrE  1313    0.703          0.843 0.111 0.903    0.350 1001.838   1.149 1.978
NZE  1867    0.694          0.849 0.107 0.905    0.347 1418.148   1.109 1.940
HKE  1529    0.680          0.853 0.106 0.917    0.336 1135.965   1.123 2.128
IndE 1836    0.752          0.874 0.091 0.921    0.297 1212.166   1.133 1.923
JamE 1307    0.772          0.876 0.088 0.919    0.286  847.436   1.212 1.981
PhlE 1749    0.719          0.851 0.110 0.899    0.350 1323.141   1.092 1.912
SgE  1225    0.656          0.849 0.112 0.910    0.361  991.788   1.225 2.365
     HosLem.p
BrE     0.014
CanE    0.550
IrE     0.527
NZE     0.031
HKE     0.767
IndE    0.565
JamE    0.564
PhlE    0.594
SgE     0.004

Now fit random forest for each variety.

if (file.exists(here("model_output", "gen_crf_vadis_list.rds"))) {
  genitives_variety_forests_vadis <- readRDS(
    here("model_output", "gen_crf_vadis_list.rds")
  )
} else {
  genitives_variety_forests_vadis <- lapply(
    data_genitives_split, function(x) fit_party_crf(gen_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
  )
  saveRDS(genitives_variety_forests_vadis, here("model_output", "gen_crf_vadis_list.rds"))
}

Run the VADIS analyses for lines 1, 2 and 3.

gen_signif_line <- vadis_line1(genitives_variety_glmer,
  path = here("model_output", "gens_line1.rds"),
  overwrite = "reload"
)
gen_coef_line <- vadis_line2(genitives_variety_glmer,
  path = here("model_output", "gens_line2.rds"),
  overwrite = "reload"
)
gen_varimp_line <- vadis_line3(genitives_variety_forests_vadis,
  path = here("model_output", "gens_line3.rds"),
  conditional = FALSE, overwrite = "reload"
)

Now look at the mean values by line and variety.

gen_mean_sims <- data.frame(
  line1 = gen_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
  line2 = gen_coef_line$similarity.scores[, 1],
  line3 = gen_varimp_line$similarity.scores[, 1]
) %>%
  add_row(!!!colMeans(.))
gen_mean_sims$Variety_Mean <- rowMeans(gen_mean_sims)
rownames(gen_mean_sims) <- c(names(data_genitives_split), "Line Mean")
round(gen_mean_sims, 3)
          line1 line2 line3 Variety_Mean
BrE       0.931 0.675 0.845        0.817
CanE      0.889 0.713 0.890        0.831
IrE       0.931 0.752 0.848        0.844
NZE       0.889 0.727 0.890        0.835
HKE       0.861 0.679 0.705        0.749
IndE      0.861 0.619 0.780        0.753
JamE      0.931 0.703 0.786        0.806
PhlE      0.889 0.623 0.810        0.774
SgE       0.931 0.727 0.845        0.834
Line Mean 0.901 0.691 0.822        0.805

Next the DATIVE alternation.

# Add random effects
dat_f4 <- update(dat_f1, . ~ . + (1 | Verb_pruned) + (1 | GenreCoarse))

if (!file.exists(here("model_output", "dat_glmer_list.rds"))) {
  datives_variety_glmer <- vector("list")
  for (i in seq_along(data_datives_split)) {
    d <- data_datives_split[[i]]
    # standardize the model inputs, excluding the response and random effects
    d_std <- VADIS::stand(d, cols = dat_f4) # use the fitting function for convenience
    # fit the model
    datives_variety_glmer[[i]] <- glmer(dat_f4,
      data = d_std, family = binomial, # set optimizer controls to help convergence
      control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 1e7))
    )
    rm(d, d_std) # remove datasets
  }
  names(datives_variety_glmer) <- names(data_datives_split)
  saveRDS(datives_variety_glmer, here("model_output", "dat_glmer_list.rds"))
} else {
  datives_variety_glmer <- readRDS(here("model_output", "dat_glmer_list.rds"))
}

Check models.

round(VADIS::summary_stats(datives_variety_glmer), 3)
        N baseline predicted.corr Brier     C LogScore      AIC Max.VIF kappa
BrE  1318    0.707          0.891 0.077 0.949    0.257  775.557   2.538 4.009
CanE 1357    0.713          0.900 0.072 0.960    0.230  725.754   2.089 3.759
IrE  1270    0.728          0.896 0.074 0.953    0.244  724.136   2.081 4.025
NZE  1485    0.711          0.894 0.080 0.946    0.263  893.623   2.228 3.747
HKE  1768    0.639          0.890 0.081 0.952    0.265 1064.279   2.003 3.929
IndE 1550    0.588          0.898 0.076 0.956    0.259  932.243   2.017 4.241
JamE 1375    0.711          0.925 0.059 0.969    0.202  665.191   2.412 4.107
PhlE 1505    0.661          0.909 0.069 0.963    0.229  794.234   2.154 4.134
SgE  1543    0.709          0.888 0.079 0.951    0.256  888.064   2.447 4.001
     HosLem.p
BrE     0.482
CanE    0.828
IrE     0.308
NZE     0.082
HKE     0.239
IndE    0.002
JamE    0.327
PhlE    0.927
SgE     0.374
if (file.exists(here("model_output", "dat_crf_vadis_list.rds"))) {
  datives_variety_forests_vadis <- readRDS(
    here("model_output", "dat_crf_vadis_list.rds")
  )
} else {
  datives_variety_forests_vadis <- lapply(
    data_datives_split, function(x) fit_party_crf(dat_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
  )
  saveRDS(datives_variety_forests_vadis, here("model_output", "dat_crf_vadis_list.rds"))
}

Run the VADIS analyses for lines 1, 2 and 3.

dat_signif_line <- vadis_line1(datives_variety_glmer,
  path = here("model_output", "dat_line1.rds"),
  overwrite = "reload"
)
dat_coef_line <- vadis_line2(datives_variety_glmer,
  path = here("model_output", "dat_line2.rds"),
  overwrite = "reload"
)
dat_varimp_line <- vadis_line3(datives_variety_forests_vadis,
  path = here("model_output", "dat_line3.rds"),
  conditional = FALSE, overwrite = "reload"
)

Now look at the mean values by line and variety.

dat_mean_sims <- data.frame(
  line1 = dat_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
  line2 = dat_coef_line$similarity.scores[, 1],
  line3 = dat_varimp_line$similarity.scores[, 1]
) %>%
  add_row(!!!colMeans(.))
dat_mean_sims$Variety_Mean <- rowMeans(dat_mean_sims)
rownames(dat_mean_sims) <- c(names(data_genitives_split), "Line Mean")
round(dat_mean_sims, 3)
          line1 line2 line3 Variety_Mean
BrE       0.703 0.738 0.732        0.724
CanE      0.625 0.750 0.685        0.687
IrE       0.656 0.711 0.688        0.685
NZE       0.781 0.774 0.723        0.760
HKE       0.641 0.744 0.771        0.719
IndE      0.719 0.720 0.720        0.720
JamE      0.688 0.553 0.842        0.694
PhlE      0.656 0.712 0.732        0.700
SgE       0.781 0.735 0.798        0.771
Line Mean 0.694 0.715 0.743        0.718

Finally, the PARTICLE PLACEMENT alternation.

# Add random effects
# f2 <- update(f1, .~. + (1|FileID) + (1|VerbPart_pruned) + (1|DirObjHeadPlain_pruned) + (1|GenreCoarse))
pv_f4 <- update(pv_f1, . ~ . + (1 | VerbPart_pruned) + (1 | Genre))

if (!file.exists(here("model_output", "pv_glmer_list.rds"))) {
  pv_variety_glmer <- vector("list")
  for (i in seq_along(data_particle_verbs_split)) {
    d <- data_particle_verbs_split[[i]]
    # standardize the model inputs, excluding the response and random effects
    d_std <- VADIS::stand(d, cols = pv_f4) # use the fitting function for convenience
    # fit the model
    pv_variety_glmer[[i]] <- glmer(pv_f4,
      data = d_std, family = binomial, # set optimizer controls to help convergence
      control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 1e7))
    )
    rm(d, d_std) # remove datasets
  }
  names(pv_variety_glmer) <- names(data_particle_verbs_split)
  saveRDS(pv_variety_glmer, here("model_output", "pv_glmer_list.rds"))
} else {
  pv_variety_glmer <- readRDS(here("model_output", "pv_glmer_list.rds"))
}

Check models.

round(VADIS::summary_stats(pv_variety_glmer), 3)
        N baseline predicted.corr Brier     C LogScore      AIC Max.VIF kappa
BrE  1324    0.709          0.832 0.117 0.893    0.365 1118.141   1.164 1.699
CanE 1392    0.693          0.830 0.114 0.905    0.354 1161.421   1.148 1.821
IrE  1347    0.734          0.834 0.114 0.891    0.358 1125.746   1.167 1.714
NZE  1543    0.686          0.854 0.108 0.912    0.346 1267.649   1.133 1.666
HKE  1208    0.851          0.882 0.082 0.898    0.265  761.069   1.180 1.767
IndE 1083    0.902          0.922 0.064 0.877    0.219  543.611   1.260 1.928
JamE 1106    0.854          0.892 0.083 0.880    0.276  709.759   1.253 1.726
PhlE 1004    0.865          0.888 0.083 0.885    0.265  602.609   1.215 1.686
SgE  1333    0.848          0.893 0.081 0.908    0.260  828.958   1.201 1.689
     HosLem.p
BrE     0.406
CanE    0.051
IrE     0.021
NZE     0.196
HKE     0.125
IndE    0.549
JamE    0.181
PhlE    0.855
SgE     0.042
if (file.exists(here("model_output", "pv_crf_vadis_list.rds"))) {
  pv_variety_forests_vadis <- readRDS(
    here("model_output", "pv_crf_vadis_list.rds")
  )
} else {
  pv_variety_forests_vadis <- lapply(
    data_particle_verbs_split, function(x) fit_party_crf(pv_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
  )
  saveRDS(pv_variety_forests_vadis, here("model_output", "pv_crf_vadis_list.rds"))
}

Run the VADIS analyses for lines 1, 2 and 3.

pv_signif_line <- vadis_line1(pv_variety_glmer,
  path = here("model_output", "pv_line1.rds"),
  overwrite = "reload"
)
pv_coef_line <- vadis_line2(pv_variety_glmer,
  path = here("model_output", "pv_line2.rds"),
  overwrite = "reload"
)
pv_varimp_line <- vadis_line3(pv_variety_forests_vadis,
  path = here("model_output", "pv_line3.rds"),
  conditional = FALSE,
  overwrite = "reload"
)

Now look at the mean values by line and variety.

pv_mean_sims <- data.frame(
  line1 = pv_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
  line2 = pv_coef_line$similarity.scores[, 1],
  line3 = pv_varimp_line$similarity.scores[, 1]
) %>%
  add_row(!!!colMeans(.))
pv_mean_sims$Variety_Mean <- rowMeans(pv_mean_sims)
rownames(pv_mean_sims) <- c(names(data_genitives_split), "Line Mean")
round(pv_mean_sims, 3)
          line1 line2 line3 Variety_Mean
BrE       0.719 0.798 0.780        0.765
CanE      0.766 0.759 0.744        0.756
IrE       0.719 0.806 0.804        0.776
NZE       0.766 0.772 0.798        0.779
HKE       0.797 0.792 0.756        0.781
IndE      0.609 0.664 0.613        0.629
JamE      0.734 0.815 0.708        0.753
PhlE      0.812 0.740 0.631        0.728
SgE       0.766 0.819 0.774        0.786
Line Mean 0.743 0.774 0.734        0.750

Now create Table 6.5

tab <- data.frame(
  Genitives = unlist(gen_mean_sims[10, ]),
  Datives = unlist(dat_mean_sims[10, ]),
  `Particle placement` = unlist(pv_mean_sims[10, ])
)
rownames(tab) <- c("1st line", "2nd line", "3rd line", "mean")
kable(tab, digits = 2) |>
  kable_styling()
Table 4.1: Similarity coefficients across lines of evidence and alternations. Input dataset: all available data. Coefficients range between 0 (total dissimilarity) and 1 (total similarity).
Genitives Datives Particle.placement
1st line 0.90 0.69 0.74
2nd line 0.69 0.72 0.77
3rd line 0.82 0.74 0.73
mean 0.80 0.72 0.75

Core Grammar Score:

round(rowMeans(tab[4, ]), 3)
 mean 
0.758 

For simplicity, we’ve recreated Table 6.6 here. Again, all code for the analyses is in Section 6.1.

Table 4.2: Core grammar scores (0) and hierarchies of stability for subsets of the data.
Core grammar score (Γ) Hierarchy of stability
All data 0.76 genitives > particles > datives
Spoken only 0.62 particles > datives > genitives
Writen only 0.75 genitives > datives > particles
Inner Circle only 0.79 genitives > particles > datives
Outer Circle only 0.73 genitives > datives > particles

6.5 Evaluating Coherence

Figure 6.1 is shown here. Again, the numbers will vary slightly from run to run due to the random nature of Random Forests.

gen_varimp_line$distance.matrix
            BrE       CanE        IrE        NZE        HKE       IndE
CanE 0.09523810                                                       
IrE  0.09523810 0.07142857                                            
NZE  0.09523810 0.00000000 0.07142857                                 
HKE  0.42857143 0.21428571 0.30952381 0.21428571                      
IndE 0.04761905 0.16666667 0.23809524 0.16666667 0.47619048           
JamE 0.23809524 0.11904762 0.04761905 0.11904762 0.23809524 0.38095238
PhlE 0.14285714 0.16666667 0.19047619 0.16666667 0.19047619 0.19047619
SgE  0.09523810 0.04761905 0.19047619 0.04761905 0.28571429 0.09523810
           JamE       PhlE
CanE                      
IrE                       
NZE                       
HKE                       
IndE                      
JamE                      
PhlE 0.28571429           
SgE  0.28571429 0.19047619

Compare the Distance-Based Coherence (DBC). Get the fused distance matrices, combining all lines of evidence. Then calculate pairwise correlations between the three alternations.

# genitives across lines of evidence
dfused_genitives_all <- analogue::fuse(
  gen_signif_line$distance.matrix,
  gen_coef_line$distance.matrix,
  gen_varimp_line$distance.matrix
)

# datives across lines of evidence
dfused_datives_all <- analogue::fuse(
  dat_signif_line$distance.matrix,
  dat_coef_line$distance.matrix,
  dat_varimp_line$distance.matrix
)

# particles across lines of evidence
dfused_particles_all <- analogue::fuse(
  pv_signif_line$distance.matrix,
  pv_coef_line$distance.matrix,
  pv_varimp_line$distance.matrix
)

# fusing fused distance matrices across alternations
dfused_total_all <- analogue::fuse(
  dfused_genitives_all,
  dfused_datives_all,
  dfused_particles_all
)

gen_dat_cor <- vegan::mantel(dfused_genitives_all, dfused_datives_all)
gen_pv_cor <- vegan::mantel(dfused_genitives_all, dfused_particles_all)
dat_pv_cor <- vegan::mantel(dfused_datives_all, dfused_particles_all)

Create Table 6.7.

data.frame(
  Comparison = c("genitive / dative", "genitive / particle", "dative / particle"),
  r = round(c(gen_dat_cor$statistic, gen_pv_cor$statistic, dat_pv_cor$statistic), 2),
  pval = round(c(gen_dat_cor$signif, gen_pv_cor$signif, dat_pv_cor$signif), 2)
) |>
  kable() |>
  kable_styling()
Table 4.3: DBCalternation measurements: Mantel correlation coefficients between fused distance matrices (combining all lines of evidence and based on all available data).
Comparison r pval
genitive / dative 0.18 0.24
genitive / particle 0.43 0.04
dative / particle 0.18 0.18

Next calculate by-line pairwise correlations for each alternation, as in Table 6.8.

gen_line12 <- vegan::mantel(gen_signif_line$distance.matrix, gen_coef_line$distance.matrix)
gen_line13 <- vegan::mantel(gen_signif_line$distance.matrix, gen_varimp_line$distance.matrix)
gen_line23 <- vegan::mantel(gen_coef_line$distance.matrix, gen_varimp_line$distance.matrix)

dat_line12 <- vegan::mantel(dat_signif_line$distance.matrix, dat_coef_line$distance.matrix)
dat_line13 <- vegan::mantel(dat_signif_line$distance.matrix, dat_varimp_line$distance.matrix)
dat_line23 <- vegan::mantel(dat_coef_line$distance.matrix, dat_varimp_line$distance.matrix)

pv_line12 <- vegan::mantel(pv_signif_line$distance.matrix, pv_coef_line$distance.matrix)
pv_line13 <- vegan::mantel(pv_signif_line$distance.matrix, pv_varimp_line$distance.matrix)
pv_line23 <- vegan::mantel(pv_coef_line$distance.matrix, pv_varimp_line$distance.matrix)

Create Table 6.8

data.frame(
  Comparison = c("overlap 1st / 2nd lines", "overlap 1st / 3rd lines", "overlap 2nd / 3rd lines"),
  Genitive = c(
    paste0("r = ", round(gen_line12$statistic, 2), " (p = ", round(gen_line12$signif, 2), ")"),
    paste0("r = ", round(gen_line13$statistic, 2), " (p = ", round(gen_line13$signif, 2), ")"),
    paste0("r = ", round(gen_line23$statistic, 2), " (p = ", round(gen_line23$signif, 2), ")")
  ),
  Dative = c(
    paste0("r = ", round(dat_line12$statistic, 2), " (p = ", round(dat_line12$signif, 2), ")"),
    paste0("r = ", round(dat_line13$statistic, 2), " (p = ", round(dat_line13$signif, 2), ")"),
    paste0("r = ", round(dat_line23$statistic, 2), " (p = ", round(dat_line23$signif, 2), ")")
  ),
  Particle = c(
    paste0("r = ", round(pv_line12$statistic, 2), " (p = ", round(pv_line12$signif, 2), ")"),
    paste0("r = ", round(pv_line13$statistic, 2), " (p = ", round(pv_line13$signif, 2), ")"),
    paste0("r = ", round(pv_line23$statistic, 2), " (p = ", round(pv_line23$signif, 2), ")")
  )
) |>
  kable() |>
  kable_styling()
Table 4.4: Mantel correlation coefficients between line-of-evidence-specific distance matrices.
Comparison Genitive Dative Particle
overlap 1st / 2nd lines r = 0.41 (p = 0.01) r = 0.12 (p = 0.34) r = 0.36 (p = 0.06)
overlap 1st / 3rd lines r = 0.04 (p = 0.4) r = 0 (p = 0.48) r = 0.35 (p = 0.04)
overlap 2nd / 3rd lines r = 0.39 (p = 0.07) r = -0.41 (p = 0.96) r = 0.61 (p = 0)

6.6 Mapping Out (Dis)similarity Relationships

In this section we visualize the distances between varieties using Multidimensional Scaling (MDS) plots, hierarchical clustering dendrograms, and neighbornet diagrams (Bryant & Moulton 2004).

Create Figure 6.2 plotting distances based on the 3rd line of evidence for the particle placement alternation.

input <- pv_varimp_line$distance.matrix # specify distance matrix name name
fit <- cmdscale(input, eig = TRUE, k = 2) # k is the number of dim
fit.df <- as.data.frame(fit[[1]])
names(fit.df) <- c("x", "y")
fit.df$Variety <- rownames(fit.df) |> as.factor()

ggplot(fit.df, aes(x, y)) +
  geom_text_repel(aes(label = Variety),
    size = 7,
    box.padding = unit(0.5, "lines"), segment.colour = "grey65"
  ) +
  geom_point(size = 5) + # change dot size here
  labs(y = "MDS Dimension 2", x = "MDS Dimension 1") +
  theme_cup() +
  theme(axis.title = element_text(size = 15))

Figure 4.1: Multidimensional scaling representation of third line distances for the particle placement alternation (see Figure 6.1). Distances between data points in plot is proportional to probabilistic grammar distances between varieties.

Next get the MDS for each alternation.

gen_mds <- cmdscale(dfused_genitives_all, eig = TRUE, k = 2) |> # k is the number of dim
  first() |>
  as.data.frame() |>
  rename(x = "V1", y = "V2") |>
  rownames_to_column("Variety")

dat_mds <- cmdscale(dfused_datives_all, eig = TRUE, k = 2) |> # k is the number of dim
  first() |>
  as.data.frame() |>
  rename(x = "V1", y = "V2") |>
  rownames_to_column("Variety")

pv_mds <- cmdscale(dfused_particles_all, eig = TRUE, k = 2) |> # k is the number of dim
  first() |>
  as.data.frame() |>
  rename(x = "V1", y = "V2") |>
  rownames_to_column("Variety")
ggplot(gen_mds, aes(x, y)) +
  geom_text_repel(aes(label = Variety),
    size = 7,
    box.padding = unit(0.5, "lines"), segment.colour = "grey65"
  ) +
  geom_point(size = 5) + # change dot size here
  labs(y = "MDS Dimension 2", x = "MDS Dimension 1") +
  theme_cup() +
  theme(axis.title = element_text(size = 15))
ggplot(dat_mds, aes(x, y)) +
  geom_text_repel(aes(label = Variety),
    size = 7,
    box.padding = unit(0.5, "lines"), segment.colour = "grey65"
  ) +
  geom_point(size = 5) + # change dot size here
  labs(y = "MDS Dimension 2", x = "MDS Dimension 1") +
  theme_cup() +
  theme(axis.title = element_text(size = 15))
ggplot(pv_mds, aes(x, y)) +
  geom_text_repel(aes(label = Variety),
    size = 7,
    box.padding = unit(0.5, "lines"), segment.colour = "grey65"
  ) +
  geom_point(size = 5) + # change dot size here
  labs(y = "MDS Dimension 2", x = "MDS Dimension 1") +
  theme_cup() +
  theme(axis.title = element_text(size = 15))

(a)

(b)

(c)

Figure 4.2: Multidimensional scaling representation of compromise distances per alternation: a) genitive alternation; b) dative alternation; c) particle placement alternation. Distances between data points in plots is proportional to probabilistic grammar distances between varieties. Boxes depict Inner Circle clusters.

Create Figure 6.4 showing the MDS of all lines and alternations

cmdscale(dfused_total_all, eig = TRUE, k = 2) |> # k is the number of dim
  first() |>
  as.data.frame() |>
  rename(x = "V1", y = "V2") |>
  rownames_to_column("Variety") |>
  ggplot(aes(x, y)) +
  geom_text_repel(aes(label = Variety),
    size = 7,
    box.padding = unit(0.5, "lines"), segment.colour = "grey65"
  ) +
  geom_point(size = 5) + # change dot size here
  geom_rect(aes(xmin = -.18, xmax = .4, ymin = -.2, ymax = .1),
    fill = NA,
    color = "black"
  ) +
  labs(y = "MDS Dimension 2", x = "MDS Dimension 1") +
  theme_cup() +
  theme(axis.title = element_text(size = 15))

Figure 4.3: Multidimensional scaling representation of the Γ-matrix (a single compromise distance matrix merged across all lines and alternations). Distances between data points in plot is proportional to probabilistic grammar distances between varieties. Box depicts Inner Circle cluster.

Create Figure 6.5 hierarchical cluster model of all lines and alternations.

hclust(dfused_total_all, method = "ward.D2") |>
  plot()

Figure 4.4: Dendrogram: clustering the 0-matrix (a single compromise distance matrix merged across all lines and alternations). Hierarchical agglomerative cluster algorithm: WARD.

Create Figure 6.6 neighornet model of all lines and alternations. Note that plots will vary slightly from run to run.

nnet <- phangorn::neighborNet(dfused_total_all)
# plot(nnet, "3D") # movable 3D
par("mar" = rep(1, 4))
plot(nnet, "2D")

Figure 4.5: Visualizing aggregate similarities: NeighborNet diagram depicting the 0-matrix (a single compromise distance matrix merged across all lines and alternations). Internode distances (branch lengths) are proportional to cophenetic linguistic distances.

6.7 Discussion

In the discussion we present results of a simulation study validating the VADIS method.

mod_list <- readRDS(here("model_output", "intercept_speaker_glm_list.rds"))
line1 <- VADIS::vadis_line1(mod_list, path = F)
line2 <- VADIS::vadis_line2(mod_list, path = F)
rf_mod_list <- readRDS(here("model_output", "intercept_speaker_rf_list.rds"))
line3 <- VADIS::vadis_line3(rf_mod_list, path = F)

Figure 6.7 shows an MDS of distances simulating “speakers” from 5 different “varieties”, based on the compromise distances from 3 lines of evidence.

Code
fused_dist <- analogue::fuse(
  line1$distance.matrix,
  line2$distance.matrix,
  line3$distance.matrix
)
fused_mds <- cmdscale(fused_dist, k = 3) %>%
  as.data.frame()
names(fused_mds) <- c("x", "y", "z")
fused_mds <- fused_mds %>%
  rownames_to_column("Name") |>
  mutate(
    # Name = rownames(rank_mds),
    Variety = str_replace(Name, "\\w$", "") %>% factor(),
    Speaker = str_extract(Name, "\\w$") %>% factor(),
  )
levels(fused_mds$Variety) <- LETTERS[1:5]

fused_mds %>%
  ggplot(aes(x, y, group = Variety)) +
  geom_point(aes(shape = Variety)) +
  stat_ellipse(level = .9, linetype = "dashed", type = "norm", color = "grey") +
  stat_ellipse(level = .5, linewidth = 1, type = "norm", color = "grey") +
  labs(x = "MDS dimension 1", y = "MDS dimension 2") +
  theme_cup() +
  theme(
    legend.position = c(1, 0), legend.justification = c(1, 0),
    legend.background = element_rect(color = "black")
  )

Figure 4.6: Multidimensional scaling representation of the fused distances from all three VADIS lines for seventy-five simulated datasets representing five hypothetical dialect varieties. Points represent hypothetical “speakers” of five “dialects”. Distances between points in the plot are proportional to probabilistic grammar distances between datasets. Note that when simulating the data, varieties C and D were intentionally designed to be much more similar to one another than to the other three varieties.

Figure 6.8 shows a cluster analysis dendrogram based on distances from the the second line of evidence.

dd.row <- hclust(line2$distance.matrix, "average") %>%
  as.dendrogram()
ddata_x <- ggdendro::dendro_data(dd.row)

labs <- label(ddata_x) %>%
  mutate(
    Variety = LETTERS[as.integer(str_extract(label, "\\d"))],
    label = str_replace(label, "var\\d", Variety)
  )

ggplot(segment(ddata_x)) +
  geom_segment(aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_text(
    data = labs,
    aes(label = label, x = x, y = y),
    hjust = 1
  ) +
  # scale_color_manual(values = cols) +
  coord_flip() +
  theme_dendro()

Figure 4.7: Hierarchical clustering of simulated datasets based on VADIS Line 2. Leaf nodes represent individual “speakers” (lowercase letters) across five distinct “dialects” (uppercase letters), for example, “Bk” represents speaker “k” in dialect “B”, and so on. Note that dialects are clearly separated from one another as we designed them to be.

Chapter 7

In this chapter we present the analysis of the ratings experiment.

Load the experimental ratings data.

ratings_df <- here("data", "data_exp_ratings.txt") |>
  read.delim() |>
  mutate(Variety = factor(Variety, levels = c("BrE", "NZE", "SgE", "IndE")))

7.1 Methods

Create Figure 7.1 showing the distribution of corpus model probabilities for each experimental item.

ratings_df |>
  select(item, pred) |>
  mutate(pred = round(pred, 6)) |>
  distinct() |>
  mutate(prob = gtools::inv.logit(pred)) |>
  ggplot(aes(reorder(item, prob), prob)) +
  geom_point() +
  ylim(0, 1) +
  labs(x = "Item", y = "Corpus model probability") +
  theme_cup() +
  theme(axis.text.x = element_text(size = 8))

Figure 5.1: Experimental items vs. corpus model predicted probabilities.

Instruction page in Figure 7.2 (click to embiggen).

knitr::include_graphics("07_exp_instructions_page.png")

Figure 5.2: Welcome page and instructions for the ratings experiments.

7.2 Results

Correlations between corpus model predictions and ratings for split variant by variety.

Code
nested_df <- ratings_df |>
  arrange(Variety) |>
  group_by(Variety) |>
  nest()

ylab <- data.frame(
  label = "Preference rating for split variant",
  x = 1,
  y = 1
) |>
  ggplot() +
  geom_text(aes(x, y, label = label), angle = 90, family = "serif") +
  theme_void() +
  coord_cartesian(clip = "off")

p <- map(
  seq_along(nested_df$data),
  .f = function(i) {
    d <- nested_df$data[[i]]
    p <- d |>
      ggplot(aes(pred, rating_raw)) +
      geom_smooth(
        method = "loess", formula = y ~ x, se = F, color = "black",
        linetype = 2
      ) +
      geom_point(alpha = .2, size = .5, color = "grey50") +
      labs(
        x = "",
        y = ""
      ) +
      ggtitle(pull(nested_df, Variety)[i]) +
      ylim(0, 100) +
      theme_cup()

    if ((i %% 2) == 0) {
      p <- p + theme(
        # axis.title.y = element_blank(),
        axis.text.y = element_blank()
      )
    }
    p
  }
) |>
  wrap_plots()

p <- ylab + p + plot_layout(widths = c(.8, 25)) +
  plot_annotation(
    caption = "Corpus model predictions for split variant (logit scale)",
    theme = theme(plot.caption = element_text(
      hjust = 0.5, size = 12,
      margin = margin(t = 0),
      family = "serif"
    ))
  )
p

Figure 5.3: Experimental ratings vs. corpus model predicted log odds, with LOESS smooths (BrE r = .54; NZE r = .48; IndE r = .47; SgE r = .42). Ratings > 50 indicate greater preference for the split variant (pick the book up), ratings < 50 indicate greater preference for the continuous variant (pick up the book). Dots show ratings for each individual trial.

Mean ratings for split variant by variety.

meds <- c(by(ratings_df$rating_raw, ratings_df$Variety, median)) + 8

ratings_df |>
  ggplot(aes(Variety, rating_raw)) +
  geom_hline(yintercept = 0, linewidth = 1, color = "grey") +
  geom_point(
    size = 1, alpha = .4, position = position_jitter(width = .2),
    color = "darkgray", pch = 16
  ) +
  ggplot2::stat_summary(fun = "median", size = 1.5, color = "black", pch = 18) +
  geom_label(
    data = data.frame(),
    aes(x = names(meds), y = meds, label = meds),
    fontface = "bold", size = 5, family = "serif"
  ) +
  labs(x = "", y = "Predicted rating for split variant") +
  ggtitle("Variety") +
  theme_cup()

Figure 5.4: Experimental ratings across varieties, with median rating. Ratings > 50 indicate greater preference for the split variant (pick the book up), ratings < 50 indicate greater preference for the continuous variant (pick up the book).

Correlations between corpus model predictions and ratings for split variant by variety, averaged by stimulus item.

Code
nested_df <- ratings_df |>
  arrange(Variety) |>
  group_by(Variety) |>
  nest() |>
  mutate(
    data = map(
      data,
      ~ .x |>
        group_by(item) |>
        summarise(
          rating = mean(rating_raw),
          pred = median(pred)
        )
    )
  )

ylab <- data.frame(
  label = "Mean preference rating for split variant",
  x = 1,
  y = 1
) |>
  ggplot() +
  geom_text(aes(x, y, label = label), angle = 90, family = "serif") +
  theme_void() +
  coord_cartesian(clip = "off")

p <- map(
  seq_along(nested_df$data),
  .f = function(i) {
    d <- nested_df$data[[i]]
    p <- d |>
      ggplot(aes(pred, rating)) +
      geom_smooth(method = "lm", formula = y ~ x, se = F, color = "gray") +
      geom_point() +
      ggrepel::geom_text_repel(aes(label = item),
        size = 4,
        segment.color = "grey", family = "serif",
        max.overlaps = 15
      ) +
      labs(
        x = "",
        y = ""
      ) +
      ggtitle(pull(nested_df, Variety)[i]) +
      ylim(0, 100) +
      theme_cup()

    if ((i %% 2) == 0) p <- p + theme(axis.text.y = element_blank())
    return(p)
  }
) |>
  wrap_plots()

p <- ylab + p + plot_layout(widths = c(1, 25)) +
  plot_annotation(
    caption = "Corpus model predictions for split variant (logit scale)",
    theme = theme(plot.caption = element_text(
      hjust = 0.5, size = 12,
      margin = margin(t = 0),
      family = "serif"
    ))
  )
p

Figure 5.5: Experimental ratings vs. corpus model predictions, averaged by item and variety. Ratings > 50 indicate greater preference for the split variant (pick the book up), ratings < 50 indicate greater preference for the continuous variant (pick up the book). Dots show ratings for each individual trial.

Create Table 7.1 showing mean ratings (with standard deviation) per item and variety.

Code
item_means <- ratings_df |>
  group_by(item, Variety) |>
  summarise(
    mean = round(mean(rating_raw), 1),
    SD = round(sd(rating_raw, na.rm = T), 1)
  ) |>
  ungroup()

item_mean_tab <- item_means |>
  dplyr::filter(Variety == "BrE") |>
  rename(Br.Mean = "mean", Br.SD = "SD") |>
  select(-Variety) |>
  bind_cols(
    item_means |>
      dplyr::filter(Variety == "NZE") |>
      rename(NZ.Mean = "mean", NZ.SD = "SD") |>
      select(3:4),
    item_means |>
      dplyr::filter(Variety == "SgE") |>
      rename(Sg.Mean = "mean", Sg.SD = "SD") |>
      select(3:4),
    item_means |>
      dplyr::filter(Variety == "IndE") |>
      rename(Ind.Mean = "mean", Ind.SD = "SD") |>
      select(3:4)
  ) |>
  rename(Item = "item") |>
  arrange(desc(Br.Mean))

knitr::kable(item_mean_tab)
Table 5.1: Average ratings (with standard deviations) by item across participants. Ratings above 50 reflect preference for the split variant; ratings below 50 reflect preference for the continuous variant.
Item Br.Mean Br.SD NZ.Mean NZ.SD Sg.Mean Sg.SD Ind.Mean Ind.SD
B10 94.7 14.4 88.7 23.9 81.2 24.0 88.4 23.8
B8 94.0 13.2 92.7 16.9 84.2 23.5 86.6 24.5
A6 87.9 18.9 90.0 22.1 77.6 29.9 87.6 22.4
A10 87.8 16.1 80.5 32.7 62.1 41.4 55.1 36.2
B4 82.5 21.5 89.8 15.7 60.7 37.9 76.4 33.3
A9 82.4 18.7 89.9 21.6 72.4 37.4 69.1 32.7
B9 77.3 15.5 80.0 24.7 71.9 34.0 79.6 25.8
A15 76.1 22.0 79.4 31.0 82.9 22.3 77.0 27.6
B7 68.1 23.9 81.2 30.6 61.6 39.6 55.5 40.3
A8 67.2 25.2 66.1 37.5 56.2 37.7 71.4 33.3
B3 66.5 27.2 47.9 37.4 43.2 35.6 35.4 34.6
B15 65.2 29.1 64.6 35.4 56.2 39.5 48.5 36.0
A5 58.5 35.1 67.4 39.2 59.2 42.3 69.6 28.8
A12 57.8 30.7 58.7 40.2 61.6 39.6 55.4 32.7
B14 56.9 35.1 64.5 39.0 53.2 39.9 51.7 39.4
A13 56.5 29.5 39.1 37.9 44.1 41.4 43.7 33.6
B6 55.3 30.8 43.5 35.5 64.3 33.6 44.2 32.8
A4 54.0 28.3 58.6 37.0 57.2 40.2 60.4 35.4
A14 53.2 29.5 62.3 38.3 46.8 39.5 46.4 37.5
B13 48.0 28.9 60.3 37.7 45.4 40.6 42.9 36.1
B5 46.2 29.2 54.5 39.5 61.2 36.0 47.4 32.2
B1 42.0 28.1 31.0 35.9 34.1 39.0 32.6 35.8
A7 41.6 34.9 45.8 43.2 38.2 33.6 26.9 26.2
B12 39.0 28.9 57.9 38.3 67.1 38.6 49.1 39.0
A1 32.5 27.4 33.9 38.2 33.2 34.2 35.6 36.0
A3 24.9 23.8 19.2 31.4 12.4 23.7 11.2 15.5
B11 21.4 21.3 18.1 29.7 21.5 29.4 10.2 16.3
A2 20.4 22.7 16.7 31.2 10.6 18.5 10.7 16.1
A11 15.1 21.3 12.3 22.7 12.3 21.8 13.3 17.1
B2 7.6 15.1 5.5 9.9 8.6 11.8 5.6 9.6

Statistical Analysis

All continuous predictors were centered and scaled by 2 standard deviations following Gelman (2008). We also center the ratings around the midpoint 50.

Code
ratings_df <- ratings_df |>
  mutate(
    z.DirObjLettLength = z.(DirObjLettLength, factor = 2),
    z.pred_min = z.(pred_new_noLen, factor = 2),
    rating_c = rating_raw - 50,
    gender = as.factor(gender),
    education_simple = factor(education_simple, levels = c("university", "no_university", "postgraduate"))
  )

We use a custom coding for Variety, which included the following levels:

  • Level 1: Inner (BrE and NZE) vs. Outer (IndE and SgE) Circle
  • Level 2: BrE vs. NZE (Inner only)
  • Level 3: IndE vs. SgE (Outer only)

Use the {hypr} package for finding the right codings.

Code
custom_variety_hyp <- hypr::hypr(
  InnerCvsOuterC = BrE + NZE ~ IndE + SgE,
  BrEvsNZE = BrE ~ NZE,
  IndEvsSgE = IndE ~ SgE,
  levels = levels(ratings_df$Variety)
)

custom_variety_hyp
hypr object containing 3 null hypotheses:
H0.InnerCvsOuterC: 0 = BrE + NZE - IndE - SgE
      H0.BrEvsNZE: 0 = BrE - NZE
     H0.IndEvsSgE: 0 = IndE - SgE

Call:
hypr(InnerCvsOuterC = ~BrE + NZE - IndE - SgE, BrEvsNZE = ~BrE - 
    NZE, IndEvsSgE = ~IndE - SgE, levels = c("BrE", "NZE", "SgE", 
"IndE"))

Hypothesis matrix (transposed):
     InnerCvsOuterC BrEvsNZE IndEvsSgE
BrE   1              1        0       
NZE   1             -1        0       
SgE  -1              0       -1       
IndE -1              0        1       

Contrast matrix:
     InnerCvsOuterC BrEvsNZE IndEvsSgE
BrE   1/4            1/2        0     
NZE   1/4           -1/2        0     
SgE  -1/4              0     -1/2     
IndE -1/4              0      1/2     
Code
ratings_df$Variety_custom <- ratings_df$Variety
contrasts(ratings_df$Variety_custom) <- hypr::contr.hypothesis(custom_variety_hyp)
contrasts(ratings_df$Variety_custom)
     InnerCvsOuterC BrEvsNZE IndEvsSgE
BrE            0.25      0.5       0.0
NZE            0.25     -0.5       0.0
SgE           -0.25      0.0      -0.5
IndE          -0.25      0.0       0.5

We fit a linear mixed model (estimated using REML and nloptwrap optimizer) to predict centered rating [rating_c with variety [Variety_custom], direct object length [z.DirObjLettLength], corpus model prediction [z.pred_min], gender, and education [education_simpl]. The model included second-order interaction terms for variety x direct object length and variety x corpus prediction (formula: rating_c ~ Variety_custom * (z.DirObjLettLength + z.pred_min) + gender + education_simple). The model included random intercepts for participant [id] and stimulus item, along with by-id slopes for the effect of corpus model prediction (z.pred_min) and direct object length (z.DirObjLettLength) (formula: list(~1 | id, ~1 | item, ~0 + z.pred_min | id, ~0 + z.DirObjLettLength | id)).

The final model formula:

ratings_fmla0 <- rating_c ~ (1 | id) +
  (1 | item) +
  (0 + z.pred_min | id) +
  (0 + z.DirObjLettLength | id) +
  Variety_custom +
  z.DirObjLettLength +
  z.pred_min +
  Variety_custom:z.DirObjLettLength +
  Variety_custom:z.pred_min +
  gender +
  education_simple

We fit a model with the demographic predictors, then one without.

rating_mod0 <- lmer(ratings_fmla0, data = ratings_df)
# remove demo factors
ratings_fmla1 <- update(ratings_fmla0, . ~ . - gender - education_simple)
rating_mod1 <- lmer(ratings_fmla1, data = ratings_df)

Compare the models. No evidence for effect of gender or education, but we report the one with demographic factors.

anova(rating_mod0, rating_mod1)
Data: ratings_df
Models:
rating_mod1: ratings_fmla1
rating_mod0: ratings_fmla0
            npar   AIC   BIC logLik deviance  Chisq Df Pr(>Chisq)
rating_mod1   17 38071 38177 -19018    38037                     
rating_mod0   20 38076 38201 -19018    38036 0.7638  3     0.8581
model_performance(rating_mod0)
# Indices of model performance

AIC       |      AICc |       BIC | R2 (cond.) | R2 (marg.) |   ICC |   RMSE |  Sigma
-------------------------------------------------------------------------------------
38018.629 | 38018.846 | 38144.004 |      0.377 |      0.238 | 0.183 | 29.725 | 30.562

Create Table 7.2

Code
y <- lme4::getME(rating_mod0, "y")
smry <- summary(rating_mod0)
logLik <- round(smry$logLik, 2)
aic <- round(AIC(rating_mod0), 2)
r_sq_m <- round(MuMIn::r.squaredGLMM(rating_mod0)[1], 3)
r_sq_c <- round(MuMIn::r.squaredGLMM(rating_mod0)[2], 3)
n <- nrow(getME(rating_mod0, "X"))

fixed_eff_tab <- smry$coefficients |>
  as.data.frame() |>
  dplyr::select(-3)
rownames(fixed_eff_tab) <- gsub("Variety_custom", "", rownames(fixed_eff_tab))


names(fixed_eff_tab)[4] <- "p-value"

random_eff_tab <- smry$varcor |>
  as.data.frame() |>
  mutate(grp = str_remove(grp, "\\..*")) |>
  rename("Stand.Dev." = "sdcor") |>
  mutate(
    name = c(
      "id: by-DirObjWordLength",
      "id: by-CorpusPred",
      "id: (Intercept)",
      "item: (Intercept)",
      "Residual"
    )
  ) |>
  select(6, 5) |>
  mutate(`Stand.Dev.` = round(Stand.Dev., 2))

data.frame(
  `Marginal R2` = r_sq_m,
  `Conditional R2` = r_sq_c,
  N = n,
  LogLikelihood = logLik,
  AIC = aic
) |>
  knitr::kable()
fixed_eff_tab |>
  knitr::kable(
    # format = "html",
    digits = 3,
    label = "fixef"
  )
random_eff_tab |>
  knitr::kable(
    format = "html",
    digits = 2,
    label = "ranef"
  )

Table 5.2: Summary statistics, fixed effects coefficients, and random effects standard deviations for linear mixed-model fitting participant rating as function of DIRECT OBJECT LENGTH, CORPUS MODEL PREDICTION, and participant VARIETY, GENDER, and EDUCATION LEVEL.

(a) Model summary
Marginal.R2 Conditional.R2 N LogLikelihood AIC
0.23 0.397 3900 -18989.31 38018.63
(b) Fixed effects
Estimate Std. Error t value p-value
(Intercept) 3.125 2.787 1.121 0.270
InnerCvsOuterC 11.552 2.731 4.230 0.000
BrEvsNZE -0.808 1.859 -0.434 0.664
IndEvsSgE -1.832 1.921 -0.954 0.341
z.DirObjLettLength -13.012 5.230 -2.488 0.019
z.pred_min 33.271 5.242 6.347 0.000
gendermale 0.925 1.307 0.707 0.480
education_simpleno_university -0.405 1.739 -0.233 0.816
education_simplepostgraduate -0.710 1.651 -0.430 0.667
InnerCvsOuterC:z.DirObjLettLength -12.023 4.449 -2.702 0.007
BrEvsNZE:z.DirObjLettLength -3.022 3.058 -0.988 0.324
IndEvsSgE:z.DirObjLettLength -7.334 3.207 -2.287 0.023
InnerCvsOuterC:z.pred_min 2.848 4.206 0.677 0.499
BrEvsNZE:z.pred_min -3.954 2.847 -1.389 0.166
IndEvsSgE:z.pred_min 0.308 3.071 0.100 0.920
(c) Random effects
name Stand.Dev.
id: by-DirObjWordLength 7.39
id: by-CorpusPred 5.00
id: (Intercept) 6.52
item: (Intercept) 14.00
Residual 30.56

Plot the partial effects for the Variety on ratings.

Code
effects::Effect(c("Variety_custom"), rating_mod0) |>
  as.data.frame() |>
  mutate(Variety_custom = factor(Variety_custom,
    levels = c("BrE", "NZE", "SgE", "IndE")
  )) |>
  ggplot(aes(Variety_custom, fit)) +
  geom_hline(yintercept = 0, size = 1, color = "grey") +
  geom_point(size = 4) +
  geom_errorbar(aes(ymin = lower, ymax = upper), width = .1) +
  # ylim(-10, 10) +
  labs(x = "", y = "Predicted rating (centered at 50)") +
  ggtitle("Variety") +
  theme_cup() +
  theme(
    axis.title.y = element_text(size = 20),
    axis.text.x = element_text(size = 20),
    axis.text.y = element_text(size = 20),
    plot.title = element_text(size = 28)
  )

Figure 5.6: Partial effects plots of interaction of VARIETY on participant ratings (with CORPUS PREDICTION = 0, DIRECT OBJECT LENGTH = 0.4, GENDER = “female”, and EDUCATION = “university”). Positive ratings indicate greater preference for the split variant (pick the book up), negative ratings indicate greater preference for the continuous variant (pick up the book).

Plot the partial effects for the Variety X Corpus prediction (Variety_custom:z.pred_min) interaction.

Code
cols <- RColorBrewer::brewer.pal(4, "Set1")

eff_df <- effects::Effect(
  focal.predictors = c("z.pred_min", "Variety_custom"),
  mod = rating_mod1,
  xlevels = 10
) |>
  as.data.frame()

nested_df <- eff_df |>
  arrange(Variety_custom) |>
  group_by(Variety_custom) |>
  nest() |>
  ungroup()

text_df <- eff_df |>
  group_by(Variety_custom) |>
  summarise(
    x = max(z.pred_min) + .05,
    y = max(fit)
  ) |>
  distinct() |>
  ungroup()

# single plot with overlapping lines
p1 <- eff_df |>
  ggplot(aes(z.pred_min, fit, color = Variety_custom)) +
  geom_line(aes(linetype = Variety_custom), size = 1) +
  labs(x = "", y = "") +
  geom_text(
    data = text_df,
    aes(x, y, label = Variety_custom, color = Variety_custom),
    family = "serif", size = 5, hjust = c(0, 0, 1, 0)
  ) +
  theme_cup() +
  scale_color_grey(guide = "none") +
  scale_linetype_discrete(guide = "none") +
  # xlim(-2.1, 3.8) +
  theme(legend.position = "top", legend.text = element_text(size = rel(1.2)))

# faceted plots with confidence intervals
p2 <- map(
  seq_along(nested_df$data),
  .f = function(i) {
    d <- nested_df$data[[i]]
    p <- d |>
      ggplot(aes(z.pred_min, fit)) +
      geom_ribbon(aes(ymin = lower, ymax = upper),
        alpha = .3, color = "grey"
      ) +
      geom_line(size = 1) +
      labs(
        x = "",
        y = ""
      ) +
      ggtitle(pull(nested_df, Variety_custom)[i]) +
      ylim(-28, 30) +
      theme_cup()
    return(p)
  }
) |>
  wrap_plots(ncol = 2)

# plot for label
ylab <- data.frame(
  label = "Predicted rating (centered at 50)",
  x = 1,
  y = 1
) |>
  ggplot() +
  geom_text(aes(x, y, label = label), angle = 90, family = "serif") +
  theme_void() +
  coord_cartesian(clip = "off")

p <- ylab + (p1 / p2 + plot_layout(heights = c(20, 30))) +
  plot_layout(widths = c(.8, 25)) +
  plot_annotation(
    caption = "Corpus model predicted log odds",
    theme = theme(plot.caption = element_text(
      hjust = 0.5, size = 12,
      margin = margin(t = 0),
      family = "serif"
    ))
  )
p

Figure 5.7: Partial effects plots of interaction of Variety and corpus model predictions on participant ratings (with DirObjLettLength = 15.5, Gender = ‘female’, and Education = ‘university’). Top plot shown to allow for easier comparison across varieties. Positive ratings indicate greater preference for the split variant (pick the book up), negative ratings indicate greater preference for the continuous variant (pick up the book).

Plot the partial effects for the Variety X Direct Object Length (Variety_custom:z.DirObjLettLength) interaction.

Code
cols <- RColorBrewer::brewer.pal(4, "Set1")

length_levs <- ratings_df |>
  pull(z.DirObjLettLength) |>
  round(3) |>
  unique() |>
  sort()

eff_df <- effects::Effect(
  focal.predictors = c("z.DirObjLettLength", "Variety_custom"),
  mod = rating_mod1,
  xlevels = list(z.DirObjLettLength = length_levs)
) |>
  as.data.frame() |>
  mutate(Variety_custom = factor(Variety_custom,
    levels = c("BrE", "NZE", "SgE", "IndE")
  ))

nested_df <- eff_df |>
  arrange(Variety_custom) |>
  group_by(Variety_custom) |>
  nest() |>
  ungroup()

text_df <- eff_df |>
  group_by(Variety_custom) |>
  summarise(
    x = ifelse(Variety_custom == "SgE",
      min(z.DirObjLettLength) - .05, max(z.DirObjLettLength) + .05
    ),
    y = ifelse(Variety_custom == "SgE", max(fit), min(fit))
  ) |>
  distinct() |>
  ungroup()

# single plot with overlapping lines
p1 <- eff_df |>
  ggplot(aes(z.DirObjLettLength, fit, color = Variety_custom)) +
  geom_line(aes(linetype = Variety_custom), size = 1) +
  labs(x = "", y = "") +
  geom_text(
    data = text_df,
    aes(x, y, label = Variety_custom, color = Variety_custom),
    family = "serif", size = 5, hjust = c(0, 0, 1, 0)
  ) +
  theme_cup() +
  scale_color_grey(guide = "none") +
  scale_linetype_discrete(guide = "none") +
  # xlim(-2.1, 3.8) +
  theme(legend.position = "top", legend.text = element_text(size = rel(1.2)))

# faceted plots with confidence intervals
p2 <- map(
  seq_along(nested_df$data),
  .f = function(i) {
    d <- nested_df$data[[i]]
    p <- d |>
      ggplot(aes(z.DirObjLettLength, fit)) +
      geom_ribbon(aes(ymin = lower, ymax = upper),
        alpha = .3, color = "grey"
      ) +
      geom_line(size = 1) +
      labs(
        x = "",
        y = ""
      ) +
      ggtitle(pull(nested_df, Variety_custom)[i]) +
      ylim(-28, 30) +
      theme_cup()
    return(p)
  }
) |>
  wrap_plots(ncol = 2)

# plot for label
ylab <- data.frame(
  label = "Predicted rating (centered at 50)",
  x = 1,
  y = 1
) |>
  ggplot() +
  geom_text(aes(x, y, label = label), angle = 90, family = "serif") +
  theme_void() +
  coord_cartesian(clip = "off")

p <- ylab + (p1 / p2 + plot_layout(heights = c(20, 30))) +
  plot_layout(widths = c(.8, 25)) +
  plot_annotation(
    caption = "Number of words in direct object\n(z-score standardized)",
    theme = theme(plot.caption = element_text(
      hjust = 0.5, size = 12,
      margin = margin(t = 0),
      family = "serif"
    ))
  )
p

Figure 5.8: Partial effects plots of interaction of Variety and Direct Object Length on participant ratings (with Corpus Prediction = −.19, Gender = ‘female’, and Education = ‘university’). Top plot shown to allow for easier comparison across varieties. Positive ratings indicate greater preference for the split variant (pick the book up), negative ratings indicate greater preference for the continuous variant (pick up the book).

7.3 Discussion

Create the tables showing the amount of agreement between participants and corpus model.

ratings_df <- ratings_df |>
  mutate(
    Observed = str_replace_all(Response, "Discontinuous", "Split"),
    corpus_pred = ifelse(pred_new > 0, "Split", "Continuous"),
    participant_pref = ifelse(rating_c > 0, "Split", "Continuous"),
    agree = ifelse(corpus_pred == participant_pref, 1, 0),
    Corpus_model_correct = ifelse(corpus_pred == Observed, "Yes", "No")
  )

ratings_df |>
  group_by(Variety) |>
  summarise(
    Percent_agree = round(100 * mean(agree), 1)
  )

?(caption)

# A tibble: 4 × 2
  Variety Percent_agree
  <fct>           <dbl>
1 BrE              69.2
2 NZE              67.4
3 SgE              65.4
4 IndE             67.2
ratings_df |>
  group_by(item) |>
  summarise(
    Mean_rating = round(mean(rating_raw), 1),
    SD_rating = round(sd(rating_raw, na.rm = T), 1),
    Percent_agree = mean(agree)
  ) |>
  ungroup() |>
  arrange(desc(Percent_agree)) |>
  inner_join(ratings_df |> select(item, pred, Observed, Corpus_model_correct), by = "item") |>
  distinct(item, .keep_all = T) |>
  select(item, Mean_rating, pred, Percent_agree, Observed, Corpus_model_correct) |>
  rename("Corpus_model_prob" = "pred", Item = "item")

?(caption)

# A tibble: 30 × 6
   Item  Mean_rating Corpus_model_prob Percent_agree Observed  
   <chr>       <dbl>             <dbl>         <dbl> <chr>     
 1 B2            6.9            -3.39          0.992 Continuous
 2 A11          13.2            -7.30          0.940 Continuous
 3 B8           89.1             2.18          0.921 Split     
 4 A2           15.1            -3.26          0.918 Continuous
 5 B11          18.1            -6.83          0.897 Continuous
 6 A3           17.6            -1.34          0.896 Continuous
 7 B10          87.8             0.799         0.889 Split     
 8 A6           86.5             1.19          0.888 Split     
 9 A15          78.8             1.21          0.843 Split     
10 A9           80.4             1.56          0.843 Split     
# ℹ 20 more rows
# ℹ 1 more variable: Corpus_model_correct <chr>

Show Figure 7.9 of the random forest partial dependencies adapted from Szmrecsanyi et al. (2016).

Code
knitr::include_graphics("07_sz_et_al-particle_verb_rf.png")

Figure 5.9: Predicted probabilities by DIROBJLENGTH (in words), register and VARIETY obtained from random forest model predicting the split variant (adapted from Szmrecsanyi et al., 2016a, figure 3).

Appendix

6.4 Quantification of Similarity Coefficients for data subets

Spoken data only

Just the spoken data.

data_gens_spoken <- data_genitives |>
  filter(MODE == "spoken") |>
  mutate(
    PorHeadLemma_pruned = filter_infrequent(POR_HEAD_LEMMA, 20),
    PumHeadLemma_pruned = filter_infrequent(PUM_HEAD_LEMMA, 20)
  ) |>
  droplevels()
data_gens_spoken_split <- split(data_gens_spoken, data_gens_spoken$variety_of_E)

data_dats_spoken <- data_datives |>
  filter(Mode == "spoken") |>
  mutate(
    Verb_pruned = filter_infrequent(Verb, 20),
    RecHeadPlain_pruned = filter_infrequent(RecHeadPlain, 20),
    ThemeHeadPlain_pruned = filter_infrequent(ThemeHeadPlain, 20)
  ) |>
  droplevels()
data_dats_spoken_split <- split(data_dats_spoken, data_dats_spoken$variety_of_E)

data_pv_spoken <- data_particle_verbs |>
  filter(grepl("^spok", Register)) |>
  mutate(
    VerbPart_pruned = filter_infrequent(VerbPart, 20),
    Verb_pruned = filter_infrequent(Verb, 20)
  ) |>
  droplevels()
data_pv_spoken_split <- split(data_pv_spoken, data_pv_spoken$variety_of_E)
# remove unnecessary
rm(data_gens_spoken, data_dats_spoken, data_pv_spoken)
invisible(gc(verbose = FALSE)) # free unused memory

Start with the GENITIVE alternation. First run the by-variety regression models.

if (!file.exists(here("model_output", "gen_glmer_list_spk.rds"))) {
  genitives_variety_glmer_spk <- vector("list")
  for (i in seq_along(data_gens_spoken_split)) {
    d <- data_gens_spoken_split[[i]]
    # standardize the model inputs, excluding the response and random effects
    d_std <- VADIS::stand(d, cols = gen_f4) # use the fitting function for convenience
    # fit the model
    genitives_variety_glmer_spk[[i]] <- glmer(
      gen_f4,
      data = d_std, family = binomial, control = glmer_ctrl
    )
    rm(d, d_std) # remove datasets
  }
  names(genitives_variety_glmer_spk) <- names(data_gens_spoken_split)
  saveRDS(genitives_variety_glmer_spk, here("model_output", "genitives_variety_glmer_spk.rds"))
} else {
  genitives_variety_glmer_spk <- readRDS(here("model_output", "genitives_variety_glmer_spk.rds"))
}

Check models.

round(VADIS::summary_stats(genitives_variety_glmer_spk), 3)
       N baseline predicted.corr Brier     C LogScore     AIC Max.VIF kappa
BrE  478    0.743          0.822 0.116 0.880    0.365 388.948   1.222 1.970
CanE 299    0.609          0.863 0.109 0.916    0.354 235.599   1.294 2.384
IrE  355    0.696          0.842 0.115 0.896    0.363 281.695   1.467 2.043
NZE  381    0.619          0.832 0.126 0.883    0.416 363.869   1.094 2.129
HKE  505    0.703          0.820 0.125 0.879    0.386 426.114   1.344 2.132
IndE 453    0.859          0.914 0.065 0.920    0.223 241.338   1.259 1.701
JamE 412    0.813          0.859 0.096 0.867    0.316 284.277   1.355 2.047
PhlE 545    0.844          0.890 0.079 0.886    0.266 334.841   1.172 1.758
SgE  416    0.680          0.844 0.108 0.915    0.343 328.757   1.247 2.305
     HosLem.p
BrE     0.190
CanE    0.217
IrE     0.589
NZE     0.034
HKE     0.754
IndE    0.591
JamE    0.182
PhlE    0.687
SgE     0.633

Now fit random forest for each variety.

if (file.exists(here("model_output", "gen_crf_vadis_list_spk.rds"))) {
  genitives_variety_forests_vadis_spk <- readRDS(
    here("model_output", "gen_crf_vadis_list_spk.rds")
  )
} else {
  genitives_variety_forests_vadis_spk <- lapply(
    data_gens_spoken_split, function(x) fit_party_crf(gen_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
  )
  saveRDS(genitives_variety_forests_vadis_spk, here("model_output", "genitives_variety_forests_vadis_spk.rds"))
}

Run the VADIS analyses for lines 1, 2 and 3.

gen_signif_line <- vadis_line1(genitives_variety_glmer_spk,
  path = here("model_output", "gens_line1_spk.rds"),
  overwrite = "reload"
)
gen_coef_line <- vadis_line2(genitives_variety_glmer_spk,
  path = here("model_output", "gens_line2_spk.rds"),
  overwrite = "reload"
)
gen_varimp_line <- vadis_line3(genitives_variety_forests_vadis_spk,
  path = here("model_output", "gens_line3_spk.rds"),
  conditional = FALSE,
  overwrite = "reload"
)

Now look at the mean values by line and variety.

gen_mean_sims <- data.frame(
  line1 = gen_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
  line2 = gen_coef_line$similarity.scores[, 1],
  line3 = gen_varimp_line$similarity.scores[, 1]
) %>%
  add_row(!!!colMeans(.))
gen_mean_sims$Variety_Mean <- rowMeans(gen_mean_sims)
rownames(gen_mean_sims) <- c(names(data_gens_spoken_split), "Line Mean")
round(gen_mean_sims, 3)
          line1 line2 line3 Variety_Mean
BrE       0.722 0.433 0.574        0.576
CanE      0.694 0.477 0.557        0.576
IrE       0.736 0.479 0.625        0.613
NZE       0.667 0.425 0.560        0.550
HKE       0.806 0.183 0.211        0.400
IndE      0.625 0.298 0.461        0.462
JamE      0.736 0.444 0.661        0.614
PhlE      0.736 0.403 0.438        0.525
SgE       0.722 0.493 0.586        0.601
Line Mean 0.716 0.404 0.519        0.546

Next the DATIVE alternation.

if (!file.exists(here("model_output", "dat_glmer_list_spk.rds"))) {
  datives_variety_glmer_spk <- vector("list")
  for (i in seq_along(data_dats_spoken_split)) {
    d <- data_dats_spoken_split[[i]]
    # standardize the model inputs, excluding the response and random effects
    d_std <- VADIS::stand(d, cols = dat_f4) # use the fitting function for convenience
    # fit the model
    datives_variety_glmer_spk[[i]] <- glmer(
      dat_f4,
      data = d_std, family = binomial, control = glmer_ctrl
    )
    rm(d, d_std) # remove datasets
  }
  names(datives_variety_glmer_spk) <- names(data_dats_spoken_split)
  saveRDS(datives_variety_glmer_spk, here("model_output", "datives_variety_glmer_spk.rds"))
} else {
  datives_variety_glmer_spk <- readRDS(here("model_output", "datives_variety_glmer_spk.rds"))
}

Check models.

round(VADIS::summary_stats(datives_variety_glmer_spk), 3)
       N baseline predicted.corr Brier     C LogScore     AIC Max.VIF kappa
BrE  523    0.744          0.922 0.058 0.957    0.212 279.934   2.512 4.442
CanE 570    0.767          0.911 0.059 0.969    0.189 271.397   2.517 4.022
IrE  521    0.770          0.914 0.061 0.967    0.199 261.081   2.476 4.121
NZE  580    0.734          0.890 0.080 0.947    0.257 354.169   3.004 4.238
HKE  710    0.699          0.911 0.062 0.968    0.204 366.309   2.303 4.218
IndE 643    0.577          0.925 0.061 0.967    0.217 340.888   2.231 4.595
JamE 615    0.792          0.967 0.028 0.987    0.103 188.365   2.485 4.408
PhlE 622    0.711          0.916 0.064 0.961    0.218 328.455   2.949 4.568
SgE  690    0.757          0.919 0.065 0.956    0.223 371.941   2.773 4.228
     HosLem.p
BrE     0.329
CanE    0.909
IrE     0.947
NZE     0.496
HKE     0.969
IndE    0.521
JamE    0.000
PhlE    0.790
SgE     0.688

Now fit random forest for each variety.

if (file.exists(here("model_output", "dat_crf_vadis_list_spk.rds"))) {
  datives_variety_forests_vadis_spk <- readRDS(
    here("model_output", "dat_crf_vadis_list_spk.rds")
  )
} else {
  datives_variety_forests_vadis_spk <- lapply(
    data_dats_spoken_split, function(x) fit_party_crf(dat_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
  )
  saveRDS(datives_variety_forests_vadis_spk, here("model_output", "datives_variety_forests_vadis_spk.rds"))
}

Run the VADIS analyses for lines 1, 2 and 3.

dat_signif_line <- vadis_line1(datives_variety_glmer_spk,
  path = here("model_output", "dats_line1_spk.rds"),
  overwrite = "reload"
)
dat_coef_line <- vadis_line2(datives_variety_glmer_spk,
  path = here("model_output", "dats_line2_spk.rds"),
  overwrite = "reload"
)
dat_varimp_line <- vadis_line3(datives_variety_forests_vadis_spk,
  path = here("model_output", "dats_line3_spk.rds"),
  conditional = FALSE,
  overwrite = "reload"
)

Now look at the mean values by line and variety.

dat_mean_sims <- data.frame(
  line1 = dat_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
  line2 = dat_coef_line$similarity.scores[, 1],
  line3 = dat_varimp_line$similarity.scores[, 1]
) %>%
  add_row(!!!colMeans(.))
dat_mean_sims$Variety_Mean <- rowMeans(dat_mean_sims)
rownames(dat_mean_sims) <- c(names(data_dats_spoken_split), "Line Mean")
round(dat_mean_sims, 3)
          line1  line2 line3 Variety_Mean
BrE       0.797  0.603 0.506        0.635
CanE      0.703  0.511 0.714        0.643
IrE       0.797  0.565 0.649        0.670
NZE       0.641  0.573 0.640        0.618
HKE       0.781  0.565 0.545        0.630
IndE      0.672  0.509 0.628        0.603
JamE      0.781 -0.058 0.637        0.453
PhlE      0.594  0.570 0.646        0.603
SgE       0.734  0.556 0.536        0.609
Line Mean 0.722  0.488 0.611        0.607

Finally, the PARTICLE PLACEMENT alternation.

if (!file.exists(here("model_output", "pv_glmer_list_spk.rds"))) {
  pv_variety_glmer_spk <- vector("list")
  for (i in seq_along(data_pv_spoken_split)) {
    d <- data_pv_spoken_split[[i]]
    # standardize the model inputs, excluding the response and random effects
    d_std <- VADIS::stand(d, cols = pv_f4) # use the fitting function for convenience
    # fit the model
    pv_variety_glmer_spk[[i]] <- glmer(
      pv_f4,
      data = d_std, family = binomial, control = glmer_ctrl
    )
    rm(d, d_std) # remove datasets
  }
  names(pv_variety_glmer_spk) <- names(data_pv_spoken_split)
  saveRDS(pv_variety_glmer_spk, here("model_output", "pv_variety_glmer_spk.rds"))
} else {
  pv_variety_glmer_spk <- readRDS(here("model_output", "pv_variety_glmer_spk.rds"))
}

Check models.

round(VADIS::summary_stats(pv_variety_glmer_spk), 3)
       N baseline predicted.corr Brier     C LogScore     AIC Max.VIF kappa
BrE  573    0.557          0.787 0.146 0.870    0.446 595.582   1.196 1.799
CanE 669    0.556          0.806 0.135 0.887    0.421 646.380   1.160 1.897
IrE  674    0.614          0.773 0.156 0.843    0.472 724.131   1.226 1.838
NZE  713    0.560          0.827 0.122 0.906    0.384 659.498   1.188 1.744
HKE  493    0.811          0.870 0.098 0.886    0.314 362.358   1.216 1.853
IndE 409    0.866          0.914 0.062 0.928    0.211 229.046   1.352 2.158
JamE 478    0.808          0.872 0.092 0.897    0.299 363.115   1.312 1.886
PhlE 366    0.839          0.902 0.072 0.922    0.238 224.632   1.370 1.885
SgE  527    0.793          0.867 0.093 0.907    0.295 368.461   1.270 1.874
     HosLem.p
BrE     0.811
CanE    0.489
IrE     0.764
NZE     0.104
HKE     0.761
IndE    0.914
JamE    0.401
PhlE    0.940
SgE     0.797

Now fit random forest for each variety.

if (file.exists(here("model_output", "pv_crf_vadis_list_spk.rds"))) {
  pv_variety_forests_vadis_spk <- readRDS(
    here("model_output", "pv_crf_vadis_list_spk.rds")
  )
} else {
  pv_variety_forests_vadis_spk <- lapply(
    data_pv_spoken_split, function(x) fit_party_crf(pv_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
  )
  saveRDS(pv_variety_forests_vadis_spk, here("model_output", "pv_variety_forests_vadis_spk.rds"))
}

Run the VADIS analyses for lines 1, 2 and 3.

pv_signif_line <- vadis_line1(pv_variety_glmer_spk,
  path = here("model_output", "pv_line1_spk.rds"),
  overwrite = "reload"
)
pv_coef_line <- vadis_line2(pv_variety_glmer_spk,
  path = here("model_output", "pv_line2_spk.rds"),
  overwrite = "reload"
)
pv_varimp_line <- vadis_line3(pv_variety_forests_vadis_spk,
  path = here("model_output", "pv_line3_spk.rds"),
  conditional = FALSE,
  overwrite = "reload"
)

Now look at the mean values by line and variety.

pv_mean_sims <- data.frame(
  line1 = pv_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
  line2 = pv_coef_line$similarity.scores[, 1],
  line3 = pv_varimp_line$similarity.scores[, 1]
) %>%
  add_row(!!!colMeans(.))
pv_mean_sims$Variety_Mean <- rowMeans(pv_mean_sims)
rownames(pv_mean_sims) <- c(names(data_pv_spoken_split), "Line Mean")
round(pv_mean_sims, 3)
          line1 line2 line3 Variety_Mean
BrE       0.734 0.726 0.765        0.742
CanE      0.734 0.688 0.795        0.739
IrE       0.719 0.716 0.798        0.744
NZE       0.672 0.705 0.771        0.716
HKE       0.703 0.731 0.804        0.746
IndE      0.531 0.353 0.503        0.462
JamE      0.766 0.687 0.643        0.699
PhlE      0.719 0.645 0.750        0.705
SgE       0.734 0.678 0.708        0.707
Line Mean 0.701 0.659 0.726        0.695

Now create cummary table.

tab <- data.frame(
  Genitives = unlist(gen_mean_sims[10, ]),
  Datives = unlist(dat_mean_sims[10, ]),
  `Particle placement` = unlist(pv_mean_sims[10, ])
) |>
  round(2)
rownames(tab) <- c("1st line", "2nd line", "3rd line", "mean")
tab

?(caption)

         Genitives Datives Particle.placement
1st line      0.72    0.72               0.70
2nd line      0.40    0.49               0.66
3rd line      0.52    0.61               0.73
mean          0.55    0.61               0.70

Core Grammar Score for spoken data only:

round(rowMeans(tab[4, ]), 3)
mean 
0.62 
# clean up
rm(genitives_variety_forests_vadis_spk, datives_variety_forests_vadis_spk, pv_variety_forests_vadis_spk)
invisible(gc(verbose = FALSE)) # free unused memory

Written data only

Just the written data.

data_gens_written <- data_genitives |>
  filter(MODE == "written") |>
  mutate(
    PorHeadLemma_pruned = filter_infrequent(POR_HEAD_LEMMA, 20),
    PumHeadLemma_pruned = filter_infrequent(PUM_HEAD_LEMMA, 20)
  ) |>
  droplevels()
data_gens_written_split <- split(data_gens_written, data_gens_written$variety_of_E)

data_dats_written <- data_datives |>
  filter(Mode == "written") |>
  mutate(
    Verb_pruned = filter_infrequent(Verb, 20),
    RecHeadPlain_pruned = filter_infrequent(RecHeadPlain, 20),
    ThemeHeadPlain_pruned = filter_infrequent(ThemeHeadPlain, 20)
  ) |>
  droplevels()
data_dats_written_split <- split(data_dats_written, data_dats_written$variety_of_E)

data_pv_written <- data_particle_verbs |>
  filter(grepl("^spok", Register)) |>
  mutate(
    VerbPart_pruned = filter_infrequent(VerbPart, 20),
    Verb_pruned = filter_infrequent(Verb, 20)
  ) |>
  droplevels()
data_pv_written_split <- split(data_pv_written, data_pv_written$variety_of_E)
# remove unnecessary
rm(data_gens_written, data_dats_written, data_pv_written)
invisible(gc(verbose = FALSE)) # free unused memory

Start with the GENITIVE alternation. First run the by-variety regression models.

if (!file.exists(here("model_output", "gen_glmer_list_wrt.rds"))) {
  genitives_variety_glmer_wrt <- vector("list")
  for (i in seq_along(data_gens_written_split)) {
    d <- data_gens_written_split[[i]]
    # standardize the model inputs, excluding the response and random effects
    d_std <- VADIS::stand(d, cols = gen_f4) # use the fitting function for convenience
    # fit the model
    genitives_variety_glmer_wrt[[i]] <- glmer(
      gen_f4,
      data = d_std, family = binomial, control = glmer_ctrl
    )
    rm(d, d_std) # remove datasets
  }
  names(genitives_variety_glmer_wrt) <- names(data_gens_written_split)
  saveRDS(genitives_variety_glmer_wrt, here("model_output", "genitives_variety_glmer_wrt.rds"))
} else {
  genitives_variety_glmer_wrt <- readRDS(here("model_output", "genitives_variety_glmer_wrt.rds"))
}

Check models.

round(VADIS::summary_stats(genitives_variety_glmer_wrt), 3)
        N baseline predicted.corr Brier     C LogScore      AIC Max.VIF kappa
BrE  1134    0.723          0.863 0.099 0.915    0.321  798.379   1.292 1.979
CanE 1061    0.618          0.846 0.112 0.916    0.352  806.000   1.281 2.225
IrE   958    0.706          0.849 0.113 0.898    0.355  735.456   1.149 1.981
NZE  1486    0.713          0.861 0.103 0.906    0.332 1076.406   1.166 1.917
HKE  1024    0.668          0.857 0.096 0.935    0.305  708.852   1.125 2.159
IndE 1383    0.717          0.863 0.098 0.918    0.317  966.937   1.139 2.000
JamE  895    0.753          0.884 0.086 0.929    0.280  579.985   1.283 1.979
PhlE 1204    0.663          0.826 0.120 0.899    0.375  968.461   1.100 2.022
SgE   809    0.644          0.852 0.112 0.912    0.359  666.935   1.355 2.420
     HosLem.p
BrE     0.021
CanE    0.619
IrE     0.569
NZE     0.002
HKE     0.422
IndE    0.628
JamE    0.759
PhlE    0.679
SgE     0.268

Now fit random forest for each variety.

if (file.exists(here("model_output", "gen_crf_vadis_list_wrt.rds"))) {
  genitives_variety_forests_vadis_wrt <- readRDS(
    here("model_output", "gen_crf_vadis_list_wrt.rds")
  )
} else {
  genitives_variety_forests_vadis_wrt <- lapply(
    data_gens_written_split, function(x) fit_party_crf(gen_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
  )
  saveRDS(genitives_variety_forests_vadis_wrt, here("model_output", "genitives_variety_forests_vadis_wrt.rds"))
}

Run the VADIS analyses for lines 1, 2 and 3.

gen_signif_line <- vadis_line1(genitives_variety_glmer_wrt,
  path = here("model_output", "gens_line1_wrt.rds"),
  overwrite = "reload"
)
gen_coef_line <- vadis_line2(genitives_variety_glmer_wrt,
  path = here("model_output", "gens_line2_wrt.rds"),
  overwrite = "reload"
)
gen_varimp_line <- vadis_line3(genitives_variety_forests_vadis_wrt,
  path = here("model_output", "gens_line3_wrt.rds"),
  conditional = FALSE,
  overwrite = "reload"
)

Now look at the mean values by line and variety.

gen_mean_sims <- data.frame(
  line1 = gen_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
  line2 = gen_coef_line$similarity.scores[, 1],
  line3 = gen_varimp_line$similarity.scores[, 1]
) %>%
  add_row(!!!colMeans(.))
gen_mean_sims$Variety_Mean <- rowMeans(gen_mean_sims)
rownames(gen_mean_sims) <- c(names(data_gens_written_split), "Line Mean")
round(gen_mean_sims, 3)
          line1 line2 line3 Variety_Mean
BrE       0.903 0.627 0.836        0.789
CanE      0.833 0.687 0.872        0.797
IrE       0.833 0.717 0.827        0.793
NZE       0.861 0.694 0.762        0.772
HKE       0.833 0.653 0.818        0.768
IndE      0.792 0.630 0.756        0.726
JamE      0.903 0.561 0.804        0.756
PhlE      0.861 0.540 0.789        0.730
SgE       0.903 0.660 0.857        0.806
Line Mean 0.858 0.641 0.813        0.771

Next the DATIVE alternation.

if (!file.exists(here("model_output", "dat_glmer_list_wrt.rds"))) {
  datives_variety_glmer_wrt <- vector("list")
  for (i in seq_along(data_dats_written_split)) {
    d <- data_dats_written_split[[i]]
    # standardize the model inputs, excluding the response and random effects
    d_std <- VADIS::stand(d, cols = dat_f4) # use the fitting function for convenience
    # fit the model
    datives_variety_glmer_wrt[[i]] <- glmer(
      dat_f4,
      data = d_std, family = binomial, control = glmer_ctrl
    )
    rm(d, d_std) # remove datasets
  }
  names(datives_variety_glmer_wrt) <- names(data_dats_written_split)
  saveRDS(datives_variety_glmer_wrt, here("model_output", "datives_variety_glmer_wrt.rds"))
} else {
  datives_variety_glmer_wrt <- readRDS(here("model_output", "datives_variety_glmer_wrt.rds"))
}

Check models.

round(VADIS::summary_stats(datives_variety_glmer_wrt), 3)
        N baseline predicted.corr Brier     C LogScore     AIC Max.VIF kappa
BrE   795    0.683          0.870 0.085 0.946    0.274 521.633   2.346 3.882
CanE  787    0.675          0.892 0.077 0.958    0.246 476.893   1.764 3.645
IrE   749    0.698          0.877 0.085 0.942    0.280 504.271   1.974 4.051
NZE   905    0.696          0.884 0.083 0.943    0.273 585.645   1.771 3.581
HKE  1058    0.598          0.868 0.091 0.943    0.297 731.952   1.927 3.787
IndE  907    0.596          0.885 0.085 0.950    0.279 611.356   2.002 4.092
JamE  760    0.645          0.893 0.084 0.951    0.274 492.324   2.330 3.948
PhlE  883    0.626          0.905 0.071 0.965    0.231 491.132   1.905 3.931
SgE   853    0.671          0.869 0.090 0.945    0.282 554.641   2.278 3.897
     HosLem.p
BrE     0.414
CanE    0.754
IrE     0.627
NZE     0.454
HKE     0.014
IndE    0.042
JamE    0.921
PhlE    0.889
SgE     0.173

Now fit random forest for each variety.

if (file.exists(here("model_output", "dat_crf_vadis_list_wrt.rds"))) {
  datives_variety_forests_vadis_wrt <- readRDS(
    here("model_output", "dat_crf_vadis_list_wrt.rds")
  )
} else {
  datives_variety_forests_vadis_wrt <- lapply(
    data_dats_written_split, function(x) fit_party_crf(dat_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
  )
  saveRDS(datives_variety_forests_vadis_wrt, here("model_output", "datives_variety_forests_vadis_wrt.rds"))
}

Run the VADIS analyses for lines 1, 2 and 3.

dat_signif_line <- vadis_line1(datives_variety_glmer_wrt,
  path = here("model_output", "dats_line1_wrt.rds"),
  overwrite = "reload"
)
dat_coef_line <- vadis_line2(datives_variety_glmer_wrt,
  path = here("model_output", "dats_line2_wrt.rds"),
  overwrite = "reload"
)
dat_varimp_line <- vadis_line3(datives_variety_forests_vadis_wrt,
  path = here("model_output", "dats_line3_wrt.rds"),
  conditional = FALSE,
  overwrite = "reload"
)

Now look at the mean values by line and variety.

dat_mean_sims <- data.frame(
  line1 = dat_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
  line2 = dat_coef_line$similarity.scores[, 1],
  line3 = dat_varimp_line$similarity.scores[, 1]
) %>%
  add_row(!!!colMeans(.))
dat_mean_sims$Variety_Mean <- rowMeans(dat_mean_sims)
rownames(dat_mean_sims) <- c(names(data_dats_written_split), "Line Mean")
round(dat_mean_sims, 3)
          line1 line2 line3 Variety_Mean
BrE       0.781 0.647 0.938        0.789
CanE      0.766 0.730 0.917        0.804
IrE       0.766 0.698 0.938        0.800
NZE       0.641 0.729 0.842        0.737
HKE       0.688 0.736 0.905        0.776
IndE      0.719 0.736 0.920        0.792
JamE      0.703 0.648 0.890        0.747
PhlE      0.531 0.571 0.815        0.639
SgE       0.781 0.738 0.920        0.813
Line Mean 0.708 0.693 0.898        0.766

Finally, the PARTICLE PLACEMENT alternation.

if (!file.exists(here("model_output", "pv_glmer_list_wrt.rds"))) {
  pv_variety_glmer_wrt <- vector("list")
  for (i in seq_along(data_pv_written_split)) {
    d <- data_pv_written_split[[i]]
    # standardize the model inputs, excluding the response and random effects
    d_std <- VADIS::stand(d, cols = pv_f4) # use the fitting function for convenience
    # fit the model
    pv_variety_glmer_wrt[[i]] <- glmer(
      pv_f4,
      data = d_std, family = binomial, control = glmer_ctrl
    )
    rm(d, d_std) # remove datasets
  }
  names(pv_variety_glmer_wrt) <- names(data_pv_written_split)
  saveRDS(pv_variety_glmer_wrt, here("model_output", "pv_variety_glmer_wrt.rds"))
} else {
  pv_variety_glmer_wrt <- readRDS(here("model_output", "pv_variety_glmer_wrt.rds"))
}

Check models.

round(VADIS::summary_stats(pv_variety_glmer_wrt), 3)
       N baseline predicted.corr Brier     C LogScore     AIC Max.VIF kappa
BrE  573    0.557          0.787 0.146 0.870    0.446 595.582   1.196 1.799
CanE 669    0.556          0.806 0.135 0.887    0.421 646.380   1.160 1.897
IrE  674    0.614          0.773 0.156 0.843    0.472 724.131   1.226 1.838
NZE  713    0.560          0.827 0.122 0.906    0.384 659.498   1.188 1.744
HKE  493    0.811          0.870 0.098 0.886    0.314 362.358   1.216 1.853
IndE 409    0.866          0.914 0.062 0.928    0.211 229.046   1.352 2.158
JamE 478    0.808          0.872 0.092 0.897    0.299 363.115   1.312 1.886
PhlE 366    0.839          0.902 0.072 0.922    0.238 224.632   1.370 1.885
SgE  527    0.793          0.867 0.093 0.907    0.295 368.461   1.270 1.874
     HosLem.p
BrE     0.811
CanE    0.489
IrE     0.764
NZE     0.104
HKE     0.761
IndE    0.914
JamE    0.401
PhlE    0.940
SgE     0.797

Now fit random forest for each variety.

if (file.exists(here("model_output", "pv_crf_vadis_list_wrt.rds"))) {
  pv_variety_forests_vadis_wrt <- readRDS(
    here("model_output", "pv_crf_vadis_list_wrt.rds")
  )
} else {
  pv_variety_forests_vadis_wrt <- lapply(
    data_pv_written_split, function(x) fit_party_crf(pv_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
  )
  saveRDS(pv_variety_forests_vadis_wrt, here("model_output", "pv_variety_forests_vadis_wrt.rds"))
}

Run the VADIS analyses for lines 1, 2 and 3.

pv_signif_line <- vadis_line1(pv_variety_glmer_wrt,
  path = here("model_output", "pv_line1_wrt.rds"),
  overwrite = "reload"
)
pv_coef_line <- vadis_line2(pv_variety_glmer_wrt,
  path = here("model_output", "pv_line2_wrt.rds"),
  overwrite = "reload"
)
pv_varimp_line <- vadis_line3(pv_variety_forests_vadis_wrt,
  path = here("model_output", "pv_line3_wrt.rds"),
  conditional = FALSE,
  overwrite = "reload"
)

Now look at the mean values by line and variety.

pv_mean_sims <- data.frame(
  line1 = pv_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
  line2 = pv_coef_line$similarity.scores[, 1],
  line3 = pv_varimp_line$similarity.scores[, 1]
) %>%
  add_row(!!!colMeans(.))
pv_mean_sims$Variety_Mean <- rowMeans(pv_mean_sims)
rownames(pv_mean_sims) <- c(names(data_pv_written_split), "Line Mean")
round(pv_mean_sims, 3)
          line1 line2 line3 Variety_Mean
BrE       0.734 0.726 0.762        0.741
CanE      0.734 0.688 0.771        0.731
IrE       0.719 0.716 0.789        0.741
NZE       0.672 0.705 0.777        0.718
HKE       0.703 0.731 0.762        0.732
IndE      0.531 0.353 0.539        0.474
JamE      0.766 0.687 0.616        0.690
PhlE      0.719 0.645 0.732        0.699
SgE       0.734 0.678 0.711        0.708
Line Mean 0.701 0.659 0.718        0.693

Create summary table.

tab <- data.frame(
  Genitives = unlist(gen_mean_sims[10, ]),
  Datives = unlist(dat_mean_sims[10, ]),
  `Particle placement` = unlist(pv_mean_sims[10, ])
) |>
  round(2)
rownames(tab) <- c("1st line", "2nd line", "3rd line", "mean")
tab

?(caption)

         Genitives Datives Particle.placement
1st line      0.86    0.71               0.70
2nd line      0.64    0.69               0.66
3rd line      0.81    0.90               0.72
mean          0.77    0.77               0.69

Core Grammar Score for written data only:

round(rowMeans(tab[4, ]), 3)
 mean 
0.743 
# clean up
rm(genitives_variety_forests_vadis_wrt, datives_variety_forests_vadis_wrt, pv_variety_forests_vadis_wrt)
invisible(gc(verbose = FALSE)) # free unused memory

Inner Circle only

Just the Inner Circle data.

data_gens_inner <- data_genitives |>
  filter(Variety %in% c("gb", "ire", "nz", "can")) |>
  mutate(
    PorHeadLemma_pruned = filter_infrequent(POR_HEAD_LEMMA, 20),
    PumHeadLemma_pruned = filter_infrequent(PUM_HEAD_LEMMA, 20)
  ) |>
  droplevels()
data_gens_inner_split <- split(data_gens_inner, data_gens_inner$variety_of_E)

data_dats_inner <- data_datives |>
  filter(Variety %in% c("GB", "CAN", "NZ", "IRE")) |>
  mutate(
    Verb_pruned = filter_infrequent(Verb, 20),
    RecHeadPlain_pruned = filter_infrequent(RecHeadPlain, 20),
    ThemeHeadPlain_pruned = filter_infrequent(ThemeHeadPlain, 20)
  ) |>
  droplevels()
data_dats_inner_split <- split(data_dats_inner, data_dats_inner$variety_of_E)

data_pv_inner <- data_particle_verbs |>
  filter(Variety %in% c("GB", "CA", "NZ", "IE")) |>
  mutate(
    VerbPart_pruned = filter_infrequent(VerbPart, 20),
    Verb_pruned = filter_infrequent(Verb, 20)
  ) |>
  droplevels()
data_pv_inner_split <- split(data_pv_inner, data_pv_inner$variety_of_E)
# remove unnecessary
rm(data_gens_inner, data_dats_inner, data_pv_inner)
invisible(gc(verbose = FALSE)) # free unused memory

Start with the GENITIVE alternation. First run the by-variety regression models.

if (!file.exists(here("model_output", "gen_glmer_list_inner.rds"))) {
  genitives_variety_glmer_inner <- vector("list")
  for (i in seq_along(data_gens_inner_split)) {
    d <- data_gens_inner_split[[i]]
    # standardize the model inputs, excluding the response and random effects
    d_std <- VADIS::stand(d, cols = gen_f4) # use the fitting function for convenience
    # fit the model
    genitives_variety_glmer_inner[[i]] <- glmer(
      gen_f4,
      data = d_std, family = binomial, control = glmer_ctrl
    )
    rm(d, d_std) # remove datasets
  }
  names(genitives_variety_glmer_inner) <- names(data_gens_inner_split)
  saveRDS(genitives_variety_glmer_inner, here("model_output", "genitives_variety_glmer_inner.rds"))
} else {
  genitives_variety_glmer_inner <- readRDS(here("model_output", "genitives_variety_glmer_inner.rds"))
}

Check models.

round(VADIS::summary_stats(genitives_variety_glmer_inner), 3)
        N baseline predicted.corr Brier     C LogScore      AIC Max.VIF kappa
BrE  1612    0.729          0.848 0.107 0.901    0.342 1173.156   1.206 1.958
CanE 1360    0.616          0.846 0.113 0.912    0.358 1028.416   1.232 2.243
IrE  1313    0.703          0.836 0.114 0.899    0.359 1007.834   1.168 1.978
NZE  1867    0.694          0.844 0.112 0.895    0.361 1434.009   1.119 1.940
     HosLem.p
BrE     0.073
CanE    0.586
IrE     0.360
NZE     0.079

Now fit random forest for each variety.

if (file.exists(here("model_output", "gen_crf_vadis_list_inner.rds"))) {
  genitives_variety_forests_vadis_inner <- readRDS(
    here("model_output", "gen_crf_vadis_list_inner.rds")
  )
} else {
  genitives_variety_forests_vadis_inner <- lapply(
    data_gens_inner_split, function(x) fit_party_crf(gen_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
  )
  saveRDS(genitives_variety_forests_vadis_inner, here("model_output", "genitives_variety_forests_vadis_inner.rds"))
}

Run the VADIS analyses for lines 1, 2 and 3.

gen_signif_line <- vadis_line1(genitives_variety_glmer_inner,
  path = here("model_output", "gens_line1_inner.rds"),
  overwrite = "reload"
)
gen_coef_line <- vadis_line2(genitives_variety_glmer_inner,
  path = here("model_output", "gens_line2_inner.rds"),
  overwrite = "reload"
)
gen_varimp_line <- vadis_line3(genitives_variety_forests_vadis_inner,
  path = here("model_output", "gens_line3_inner.rds"),
  conditional = FALSE,
  overwrite = "reload"
)

Now look at the mean values by line and variety.

gen_mean_sims <- data.frame(
  line1 = gen_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
  line2 = gen_coef_line$similarity.scores[, 1],
  line3 = gen_varimp_line$similarity.scores[, 1]
) %>%
  add_row(!!!colMeans(.))
gen_mean_sims$Variety_Mean <- rowMeans(gen_mean_sims)
rownames(gen_mean_sims) <- c(names(data_gens_inner_split), "Line Mean")
round(gen_mean_sims, 3)
          line1 line2 line3 Variety_Mean
BrE       0.926 0.694 0.913        0.844
CanE      0.926 0.753 0.960        0.880
IrE       0.926 0.768 0.960        0.885
NZE       0.926 0.750 0.960        0.879
Line Mean 0.926 0.741 0.948        0.872

Next the DATIVE alternation.

if (!file.exists(here("model_output", "dat_glmer_list_inner.rds"))) {
  datives_variety_glmer_inner <- vector("list")
  for (i in seq_along(data_dats_inner_split)) {
    d <- data_dats_inner_split[[i]]
    # standardize the model inputs, excluding the response and random effects
    d_std <- VADIS::stand(d, cols = dat_f4) # use the fitting function for convenience
    # fit the model
    datives_variety_glmer_inner[[i]] <- glmer(
      dat_f4,
      data = d_std, family = binomial, control = glmer_ctrl
    )
    rm(d, d_std) # remove datasets
  }
  names(datives_variety_glmer_inner) <- names(data_dats_inner_split)
  saveRDS(datives_variety_glmer_inner, here("model_output", "datives_variety_glmer_inner.rds"))
} else {
  datives_variety_glmer_inner <- readRDS(here("model_output", "datives_variety_glmer_inner.rds"))
}

Check models.

round(VADIS::summary_stats(datives_variety_glmer_inner), 3)
        N baseline predicted.corr Brier     C LogScore     AIC Max.VIF kappa
BrE  1318    0.707          0.888 0.080 0.946    0.264 780.790   2.529 4.009
CanE 1357    0.713          0.898 0.073 0.959    0.234 728.641   2.110 3.759
IrE  1270    0.728          0.890 0.078 0.949    0.257 740.895   2.158 4.025
NZE  1485    0.711          0.890 0.083 0.943    0.271 897.885   2.211 3.747
     HosLem.p
BrE     0.676
CanE    0.803
IrE     0.693
NZE     0.030

Now fit random forest for each variety.

if (file.exists(here("model_output", "dat_crf_vadis_list_inner.rds"))) {
  datives_variety_forests_vadis_inner <- readRDS(
    here("model_output", "dat_crf_vadis_list_inner.rds")
  )
} else {
  datives_variety_forests_vadis_inner <- lapply(
    data_dats_inner_split, function(x) fit_party_crf(dat_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
  )
  saveRDS(datives_variety_forests_vadis_inner, here("model_output", "datives_variety_forests_vadis_inner.rds"))
}

Run the VADIS analyses for lines 1, 2 and 3.

dat_signif_line <- vadis_line1(datives_variety_glmer_inner,
  path = here("model_output", "dats_line1_inner.rds"),
  overwrite = "reload"
)
dat_coef_line <- vadis_line2(datives_variety_glmer_inner,
  path = here("model_output", "dats_line2_inner.rds"),
  overwrite = "reload"
)
dat_varimp_line <- vadis_line3(datives_variety_forests_vadis_inner,
  path = here("model_output", "dats_line3_inner.rds"),
  conditional = FALSE,
  overwrite = "reload"
)

Now look at the mean values by line and variety.

dat_mean_sims <- data.frame(
  line1 = dat_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
  line2 = dat_coef_line$similarity.scores[, 1],
  line3 = dat_varimp_line$similarity.scores[, 1]
) %>%
  add_row(!!!colMeans(.))
dat_mean_sims$Variety_Mean <- rowMeans(dat_mean_sims)
rownames(dat_mean_sims) <- c(names(data_dats_inner_split), "Line Mean")
round(dat_mean_sims, 3)
          line1 line2 line3 Variety_Mean
BrE       0.708 0.760 0.659        0.709
CanE      0.625 0.799 0.587        0.670
IrE       0.625 0.776 0.683        0.695
NZE       0.792 0.839 0.611        0.747
Line Mean 0.688 0.793 0.635        0.705

Finally, the PARTICLE PLACEMENT alternation.

if (!file.exists(here("model_output", "pv_glmer_list_inner.rds"))) {
  pv_variety_glmer_inner <- vector("list")
  for (i in seq_along(data_pv_inner_split)) {
    d <- data_pv_inner_split[[i]]
    # standardize the model inputs, excluding the response and random effects
    d_std <- VADIS::stand(d, cols = pv_f4) # use the fitting function for convenience
    # fit the model
    pv_variety_glmer_inner[[i]] <- glmer(
      pv_f4,
      data = d_std, family = binomial, control = glmer_ctrl
    )
    rm(d, d_std) # remove datasets
  }
  names(pv_variety_glmer_inner) <- names(data_pv_inner_split)
  saveRDS(pv_variety_glmer_inner, here("model_output", "pv_variety_glmer_inner.rds"))
} else {
  pv_variety_glmer_inner <- readRDS(here("model_output", "pv_variety_glmer_inner.rds"))
}

Check models.

round(VADIS::summary_stats(pv_variety_glmer_inner), 3)
        N baseline predicted.corr Brier     C LogScore      AIC Max.VIF kappa
BrE  1324    0.709          0.825 0.120 0.885    0.376 1122.600   1.159 1.699
CanE 1392    0.693          0.821 0.120 0.894    0.370 1173.338   1.154 1.821
IrE  1347    0.734          0.828 0.117 0.886    0.364 1124.159   1.174 1.714
NZE  1543    0.686          0.842 0.114 0.902    0.362 1275.064   1.136 1.666
     HosLem.p
BrE     0.456
CanE    0.273
IrE     0.022
NZE     0.455

Now fit random forest for each variety.

if (file.exists(here("model_output", "pv_crf_vadis_list_inner.rds"))) {
  pv_variety_forests_vadis_inner <- readRDS(
    here("model_output", "pv_crf_vadis_list_inner.rds")
  )
} else {
  pv_variety_forests_vadis_inner <- lapply(
    data_pv_inner_split, function(x) fit_party_crf(pv_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
  )
  saveRDS(pv_variety_forests_vadis_inner, here("model_output", "pv_variety_forests_vadis_inner.rds"))
}

Run the VADIS analyses for lines 1, 2 and 3.

pv_signif_line <- vadis_line1(pv_variety_glmer_inner,
  path = here("model_output", "pv_line1_inner.rds"),
  overwrite = "reload"
)
pv_coef_line <- vadis_line2(pv_variety_glmer_inner,
  path = here("model_output", "pv_line2_inner.rds"),
  overwrite = "reload"
)
pv_varimp_line <- vadis_line3(pv_variety_forests_vadis_inner,
  path = here("model_output", "pv_line3_inner.rds"),
  conditional = FALSE,
  overwrite = "reload"
)

Now look at the mean values by line and variety.

pv_mean_sims <- data.frame(
  line1 = pv_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
  line2 = pv_coef_line$similarity.scores[, 1],
  line3 = pv_varimp_line$similarity.scores[, 1]
) %>%
  add_row(!!!colMeans(.))
pv_mean_sims$Variety_Mean <- rowMeans(pv_mean_sims)
rownames(pv_mean_sims) <- c(names(data_pv_inner_split), "Line Mean")
round(pv_mean_sims, 3)
          line1 line2 line3 Variety_Mean
BrE       0.833 0.850 0.897        0.860
CanE      0.833 0.818 0.865        0.839
IrE       0.667 0.807 0.825        0.766
NZE       0.833 0.834 0.889        0.852
Line Mean 0.792 0.827 0.869        0.829

Create summary table.

tab <- data.frame(
  Genitives = unlist(gen_mean_sims[5, ]),
  Datives = unlist(dat_mean_sims[5, ]),
  `Particle placement` = unlist(pv_mean_sims[5, ])
) |>
  round(2)
rownames(tab) <- c("1st line", "2nd line", "3rd line", "mean")
tab

?(caption)

         Genitives Datives Particle.placement
1st line      0.93    0.69               0.79
2nd line      0.74    0.79               0.83
3rd line      0.95    0.63               0.87
mean          0.87    0.71               0.83

Core Grammar Score for Inner Circle data only:

round(rowMeans(tab[4, ]), 3)
 mean 
0.803 
# clean up
rm(genitives_variety_forests_vadis_inner, datives_variety_forests_vadis_inner, pv_variety_forests_vadis_inner)
invisible(gc(verbose = FALSE)) # free unused memory

Outer Circle only

Just the Outer Circle data.

data_gens_outer <- data_genitives |>
  filter(Variety %in% c("ja", "sin", "hk", "phi", "ind")) |>
  mutate(
    PorHeadLemma_pruned = filter_infrequent(POR_HEAD_LEMMA, 20),
    PumHeadLemma_pruned = filter_infrequent(PUM_HEAD_LEMMA, 20)
  ) |>
  droplevels()
data_gens_outer_split <- split(data_gens_outer, data_gens_outer$variety_of_E)

data_dats_outer <- data_datives |>
  filter(Variety %in% c("JA", "SIN", "HK", "PHI", "IND")) |>
  mutate(
    Verb_pruned = filter_infrequent(Verb, 20),
    RecHeadPlain_pruned = filter_infrequent(RecHeadPlain, 20),
    ThemeHeadPlain_pruned = filter_infrequent(ThemeHeadPlain, 20)
  ) |>
  droplevels()
data_dats_outer_split <- split(data_dats_outer, data_dats_outer$variety_of_E)

data_pv_outer <- data_particle_verbs |>
  filter(Variety %in% c("JA", "SG", "HK", "PH", "IN")) |>
  mutate(
    VerbPart_pruned = filter_infrequent(VerbPart, 20),
    Verb_pruned = filter_infrequent(Verb, 20)
  ) |>
  droplevels()
data_pv_outer_split <- split(data_pv_outer, data_pv_outer$variety_of_E)
# remove unnecessary
rm(data_gens_outer, data_dats_outer, data_pv_outer)
invisible(gc(verbose = FALSE)) # free unused memory

Start with the GENITIVE alternation. First run the by-variety regression models.

if (!file.exists(here("model_output", "gen_glmer_list_outer.rds"))) {
  genitives_variety_glmer_outer <- vector("list")
  for (i in seq_along(data_gens_outer_split)) {
    d <- data_gens_outer_split[[i]]
    # standardize the model inputs, excluding the response and random effects
    d_std <- VADIS::stand(d, cols = gen_f4) # use the fitting function for convenience
    # fit the model
    genitives_variety_glmer_outer[[i]] <- glmer(
      gen_f4,
      data = d_std, family = binomial, control = glmer_ctrl
    )
    rm(d, d_std) # remove datasets
  }
  names(genitives_variety_glmer_outer) <- names(data_gens_outer_split)
  saveRDS(genitives_variety_glmer_outer, here("model_output", "genitives_variety_glmer_outer.rds"))
} else {
  genitives_variety_glmer_outer <- readRDS(here("model_output", "genitives_variety_glmer_outer.rds"))
}

Check models.

round(VADIS::summary_stats(genitives_variety_glmer_outer), 3)
        N baseline predicted.corr Brier     C LogScore      AIC Max.VIF kappa
HKE  1529    0.680          0.846 0.110 0.911    0.346 1145.435   1.135 2.128
IndE 1836    0.752          0.871 0.093 0.916    0.303 1208.988   1.120 1.923
JamE 1307    0.772          0.871 0.092 0.910    0.299  859.519   1.237 1.981
PhlE 1749    0.719          0.847 0.111 0.898    0.352 1315.057   1.090 1.912
SgE  1225    0.656          0.847 0.115 0.906    0.367  986.384   1.240 2.365
     HosLem.p
HKE     0.552
IndE    0.953
JamE    0.271
PhlE    0.624
SgE     0.007

Now fit random forest for each variety.

if (file.exists(here("model_output", "gen_crf_vadis_list_outer.rds"))) {
  genitives_variety_forests_vadis_outer <- readRDS(
    here("model_output", "gen_crf_vadis_list_outer.rds")
  )
} else {
  genitives_variety_forests_vadis_outer <- lapply(
    data_gens_outer_split, function(x) fit_party_crf(gen_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
  )
  saveRDS(genitives_variety_forests_vadis_outer, here("model_output", "genitives_variety_forests_vadis_outer.rds"))
}

Run the VADIS analyses for lines 1, 2 and 3.

gen_signif_line <- vadis_line1(genitives_variety_glmer_outer,
  path = here("model_output", "gens_line1_outer.rds"),
  overwrite = "reload"
)
gen_coef_line <- vadis_line2(genitives_variety_glmer_outer,
  path = here("model_output", "gens_line2_outer.rds"),
  overwrite = "reload"
)
gen_varimp_line <- vadis_line3(genitives_variety_forests_vadis_outer,
  path = here("model_output", "gens_line3_outer.rds"),
  conditional = FALSE,
  overwrite = "reload"
)

Now look at the mean values by line and variety.

gen_mean_sims <- data.frame(
  line1 = gen_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
  line2 = gen_coef_line$similarity.scores[, 1],
  line3 = gen_varimp_line$similarity.scores[, 1]
) %>%
  add_row(!!!colMeans(.))
gen_mean_sims$Variety_Mean <- rowMeans(gen_mean_sims)
rownames(gen_mean_sims) <- c(names(data_gens_outer_split), "Line Mean")
round(gen_mean_sims, 3)
          line1 line2 line3 Variety_Mean
HKE       0.889 0.645 0.738        0.757
IndE      0.889 0.615 0.738        0.747
JamE      0.917 0.664 0.726        0.769
PhlE      0.833 0.571 0.774        0.726
SgE       0.917 0.687 0.810        0.804
Line Mean 0.889 0.636 0.757        0.761

Next the DATIVE alternation.

if (!file.exists(here("model_output", "dat_glmer_list_outer.rds"))) {
  datives_variety_glmer_outer <- vector("list")
  for (i in seq_along(data_dats_outer_split)) {
    d <- data_dats_outer_split[[i]]
    # standardize the model inputs, excluding the response and random effects
    d_std <- VADIS::stand(d, cols = dat_f4) # use the fitting function for convenience
    # fit the model
    datives_variety_glmer_outer[[i]] <- glmer(
      dat_f4,
      data = d_std, family = binomial, control = glmer_ctrl
    )
    rm(d, d_std) # remove datasets
  }
  names(datives_variety_glmer_outer) <- names(data_dats_outer_split)
  saveRDS(datives_variety_glmer_outer, here("model_output", "datives_variety_glmer_outer.rds"))
} else {
  datives_variety_glmer_outer <- readRDS(here("model_output", "datives_variety_glmer_outer.rds"))
}

Check models.

round(VADIS::summary_stats(datives_variety_glmer_outer), 3)
        N baseline predicted.corr Brier     C LogScore      AIC Max.VIF kappa
HKE  1768    0.639          0.890 0.081 0.952    0.265 1061.763   2.005 3.929
IndE 1550    0.588          0.895 0.076 0.954    0.263  936.315   2.056 4.241
JamE 1375    0.711          0.925 0.062 0.966    0.211  673.547   2.466 4.107
PhlE 1505    0.661          0.907 0.069 0.962    0.232  796.372   2.159 4.134
SgE  1543    0.709          0.887 0.079 0.951    0.256  885.884   2.447 4.001
     HosLem.p
HKE     0.210
IndE    0.010
JamE    0.032
PhlE    0.903
SgE     0.370

Now fit random forest for each variety.

if (file.exists(here("model_output", "dat_crf_vadis_list_outer.rds"))) {
  datives_variety_forests_vadis_outer <- readRDS(
    here("model_output", "dat_crf_vadis_list_outer.rds")
  )
} else {
  datives_variety_forests_vadis_outer <- lapply(
    data_dats_outer_split, function(x) fit_party_crf(dat_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
  )
  saveRDS(datives_variety_forests_vadis_outer, here("model_output", "datives_variety_forests_vadis_outer.rds"))
}

Run the VADIS analyses for lines 1, 2 and 3.

dat_signif_line <- vadis_line1(datives_variety_glmer_outer,
  path = here("model_output", "dats_line1_outer.rds"),
  overwrite = "reload"
)
dat_coef_line <- vadis_line2(datives_variety_glmer_outer,
  path = here("model_output", "dats_line2_outer.rds"),
  overwrite = "reload"
)
dat_varimp_line <- vadis_line3(datives_variety_forests_vadis_outer,
  path = here("model_output", "dats_line3_outer.rds"),
  conditional = FALSE,
  overwrite = "reload"
)

Now look at the mean values by line and variety.

dat_mean_sims <- data.frame(
  line1 = dat_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
  line2 = dat_coef_line$similarity.scores[, 1],
  line3 = dat_varimp_line$similarity.scores[, 1]
) %>%
  add_row(!!!colMeans(.))
dat_mean_sims$Variety_Mean <- rowMeans(dat_mean_sims)
rownames(dat_mean_sims) <- c(names(data_dats_outer_split), "Line Mean")
round(dat_mean_sims, 3)
          line1 line2 line3 Variety_Mean
HKE       0.688 0.727 0.810        0.741
IndE      0.781 0.680 0.673        0.711
JamE      0.656 0.562 0.810        0.676
PhlE      0.656 0.684 0.661        0.667
SgE       0.719 0.693 0.810        0.740
Line Mean 0.700 0.669 0.752        0.707

Finally, the PARTICLE PLACEMENT alternation.

if (!file.exists(here("model_output", "pv_glmer_list_outer.rds"))) {
  pv_variety_glmer_outer <- vector("list")
  for (i in seq_along(data_pv_outer_split)) {
    d <- data_pv_outer_split[[i]]
    # standardize the model inputs, excluding the response and random effects
    d_std <- VADIS::stand(d, cols = pv_f4) # use the fitting function for convenience
    # fit the model
    pv_variety_glmer_outer[[i]] <- glmer(
      pv_f4,
      data = d_std, family = binomial, control = glmer_ctrl
    )
    rm(d, d_std) # remove datasets
  }
  names(pv_variety_glmer_outer) <- names(data_pv_outer_split)
  saveRDS(pv_variety_glmer_outer, here("model_output", "pv_variety_glmer_outer.rds"))
} else {
  pv_variety_glmer_outer <- readRDS(here("model_output", "pv_variety_glmer_outer.rds"))
}

Check models.

round(VADIS::summary_stats(pv_variety_glmer_outer), 3)
        N baseline predicted.corr Brier     C LogScore     AIC Max.VIF kappa
HKE  1208    0.851          0.872 0.087 0.882    0.281 758.224   1.186 1.767
IndE 1083    0.902          0.921 0.064 0.879    0.219 542.674   1.269 1.928
JamE 1106    0.854          0.879 0.088 0.868    0.288 713.929   1.246 1.726
PhlE 1004    0.865          0.885 0.083 0.882    0.266 596.010   1.215 1.686
SgE  1333    0.848          0.884 0.085 0.894    0.275 832.543   1.202 1.689
     HosLem.p
HKE     0.733
IndE    0.982
JamE    0.760
PhlE    0.665
SgE     0.042

Now fit random forest for each variety.

if (file.exists(here("model_output", "pv_crf_vadis_list_outer.rds"))) {
  pv_variety_forests_vadis_outer <- readRDS(
    here("model_output", "pv_crf_vadis_list_outer.rds")
  )
} else {
  pv_variety_forests_vadis_outer <- lapply(
    data_pv_outer_split, function(x) fit_party_crf(pv_f1, x, controls = cforest_unbiased(ntree = 500, mtry = 3))
  )
  saveRDS(pv_variety_forests_vadis_outer, here("model_output", "pv_variety_forests_vadis_outer.rds"))
}

Run the VADIS analyses for lines 1, 2 and 3.

pv_signif_line <- vadis_line1(pv_variety_glmer_outer,
  path = here("model_output", "pv_line1_outer.rds"),
  overwrite = "reload"
)
pv_coef_line <- vadis_line2(pv_variety_glmer_outer,
  path = here("model_output", "pv_line2_outer.rds"),
  overwrite = "reload"
)
pv_varimp_line <- vadis_line3(pv_variety_forests_vadis_outer,
  path = here("model_output", "pv_line3_outer.rds"),
  conditional = FALSE,
  overwrite = "reload"
)

Now look at the mean values by line and variety.

pv_mean_sims <- data.frame(
  line1 = pv_signif_line$similarity.scores[, 1], # get only the values in the 2nd column
  line2 = pv_coef_line$similarity.scores[, 1],
  line3 = pv_varimp_line$similarity.scores[, 1]
) %>%
  add_row(!!!colMeans(.))
pv_mean_sims$Variety_Mean <- rowMeans(pv_mean_sims)
rownames(pv_mean_sims) <- c(names(data_pv_outer_split), "Line Mean")
round(pv_mean_sims, 3)
          line1 line2 line3 Variety_Mean
HKE       0.750 0.788 0.679        0.739
IndE      0.531 0.703 0.601        0.612
JamE      0.750 0.777 0.625        0.717
PhlE      0.781 0.772 0.601        0.718
SgE       0.688 0.793 0.744        0.741
Line Mean 0.700 0.766 0.650        0.705

Create summary table.

tab <- data.frame(
  Genitives = unlist(gen_mean_sims[6, ]),
  Datives = unlist(dat_mean_sims[6, ]),
  `Particle placement` = unlist(pv_mean_sims[6, ])
) |>
  round(2)
rownames(tab) <- c("1st line", "2nd line", "3rd line", "mean")
tab

?(caption)

         Genitives Datives Particle.placement
1st line      0.89    0.70               0.70
2nd line      0.64    0.67               0.77
3rd line      0.76    0.75               0.65
mean          0.76    0.71               0.71

Core Grammar Score for Outer Circle data only:

round(rowMeans(tab[4, ]), 3)
 mean 
0.727 
# clean up
rm(genitives_variety_forests_vadis_outer, datives_variety_forests_vadis_outer, pv_variety_forests_vadis_outer)
invisible(gc(verbose = FALSE)) # free unused memory

6.5 Evaluating Coherence

By-line pairwise correlations

gen_line12 <- vegan::mantel(gen_signif_line$distance.matrix, gen_coef_line$distance.matrix)
gen_line13 <- vegan::mantel(gen_signif_line$distance.matrix, gen_varimp_line$distance.matrix)
gen_line23 <- vegan::mantel(gen_coef_line$distance.matrix, gen_varimp_line$distance.matrix)

dat_line12 <- vegan::mantel(dat_signif_line$distance.matrix, dat_coef_line$distance.matrix)
dat_line13 <- vegan::mantel(dat_signif_line$distance.matrix, dat_varimp_line$distance.matrix)
dat_line23 <- vegan::mantel(dat_coef_line$distance.matrix, dat_varimp_line$distance.matrix)

pv_line12 <- vegan::mantel(pv_signif_line$distance.matrix, pv_coef_line$distance.matrix)
pv_line13 <- vegan::mantel(pv_signif_line$distance.matrix, pv_varimp_line$distance.matrix)
pv_line23 <- vegan::mantel(pv_coef_line$distance.matrix, pv_varimp_line$distance.matrix)

7.2 Experimental results

Below are plots of residuals and random effects adjustments for the model of experimental ratings.

Residuals

Check normality of residuals.

data.frame(resids = residuals(rating_mod0)) |>
  ggplot(aes(sample = resids)) +
  geom_qq() +
  geom_qq_line() +
  labs(x = "theoretical quantiles", "sample quantiles") +
  ggtitle("QQ plot of model residuals") +
  theme_cup()

Figure 6.1:

Random effects

We can check the BLUPs for the random effects to make sure they are normal.

Code
item_df <- lme4::ranef(rating_mod1)[["item"]] |>
  as.data.frame() |>
  tidyr::pivot_longer(cols = everything()) |>
  mutate(
    group = "item",
    name = paste0("item|", name)
  )

id_df <- lme4::ranef(rating_mod1)[["id"]] |>
  as.data.frame() |>
  tidyr::pivot_longer(cols = everything()) |>
  mutate(
    group = "item",
    name = paste0("id|", name),
    name = str_replace(name, "z.DirObjWordLength", "DirObjWordLength"),
    name = str_replace(name, "z.pred_min", "CorpusPred")
  )

full_df <- bind_rows(
  id_df,
  item_df
)

nested_df <- full_df |>
  group_by(name) |>
  nest()

p <- map(
  seq_along(nested_df$data),
  .f = function(i) {
    d <- nested_df$data[[i]]
    p <- d |>
      ggplot(aes(sample = value)) +
      geom_qq() +
      geom_qq_line() +
      labs(
        x = "",
        y = ""
      ) +
      theme_cup() +
      ggtitle(pull(nested_df, name)[i])
    return(p)
  }
) |>
  wrap_plots(ncol = 2)

ylab <- data.frame(
  label = "Theoretical",
  x = 1,
  y = 1
) |>
  ggplot() +
  geom_text(aes(x, y, label = label), angle = 90) +
  theme_void() +
  coord_cartesian(clip = "off")


p <- ylab + p + plot_layout(widths = c(1, 25)) +
  plot_annotation(
    caption = "Observed",
    theme = theme(plot.caption = element_text(
      hjust = 0.5, size = 12,
      margin = margin(t = 0),
      family = "serif"
    ))
  )
p

Figure 6.2: Distribution of BLUPs for model of experimental ratings.

Session Info

Most recent session info for the analysis.

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16 ucrt)
 os       Windows 10 x64 (build 19045)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_United Kingdom.utf8
 ctype    English_United Kingdom.utf8
 tz       Europe/London
 date     2023-07-14
 pandoc   3.1.4 @ C:/Users/grafmilj/AppData/Local/Pandoc/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package      * version    date (UTC) lib source
 abind          1.4-5      2016-07-21 [1] CRAN (R 4.3.0)
 analogue     * 0.17-6     2021-06-20 [1] CRAN (R 4.3.1)
 ape          * 5.7-1      2023-03-13 [1] CRAN (R 4.3.1)
 backports      1.4.1      2021-12-13 [1] CRAN (R 4.3.0)
 base64enc      0.1-3      2015-07-28 [1] CRAN (R 4.3.0)
 bayestestR     0.13.1     2023-04-07 [1] CRAN (R 4.3.1)
 boot           1.3-28.1   2022-11-22 [1] CRAN (R 4.3.1)
 brglm          0.7.2      2021-04-22 [1] CRAN (R 4.3.1)
 cachem         1.0.8      2023-05-01 [1] CRAN (R 4.3.1)
 callr          3.7.3      2022-11-02 [1] CRAN (R 4.3.1)
 car          * 3.1-2      2023-03-30 [1] CRAN (R 4.3.1)
 carData      * 3.0-5      2022-01-06 [1] CRAN (R 4.3.1)
 caret        * 6.0-94     2023-03-21 [1] CRAN (R 4.3.1)
 checkmate      2.2.0      2023-04-27 [1] CRAN (R 4.3.1)
 class          7.3-22     2023-05-03 [1] CRAN (R 4.3.1)
 cli            3.6.1      2023-03-23 [1] CRAN (R 4.3.1)
 cluster        2.1.4      2022-08-22 [2] CRAN (R 4.3.1)
 coda           0.19-4     2020-09-30 [1] CRAN (R 4.3.1)
 codetools      0.2-19     2023-02-01 [1] CRAN (R 4.3.0)
 coin           1.4-2      2021-10-08 [1] CRAN (R 4.3.1)
 colorspace     2.1-0      2023-01-23 [1] CRAN (R 4.3.1)
 crayon         1.5.2      2022-09-29 [1] CRAN (R 4.3.1)
 data.table     1.14.8     2023-02-17 [1] CRAN (R 4.3.1)
 datawizard     0.8.0      2023-06-16 [1] CRAN (R 4.3.1)
 DBI            1.1.3      2022-06-18 [1] CRAN (R 4.3.1)
 devtools       2.4.5      2022-10-11 [1] CRAN (R 4.3.1)
 digest         0.6.32     2023-06-26 [1] CRAN (R 4.3.1)
 dplyr        * 1.1.2      2023-04-20 [1] CRAN (R 4.3.1)
 e1071          1.7-13     2023-02-01 [1] CRAN (R 4.3.1)
 effects        4.2-2      2022-07-13 [1] CRAN (R 4.3.1)
 ellipsis       0.3.2      2021-04-29 [1] CRAN (R 4.3.1)
 emmeans        1.8.7      2023-06-23 [1] CRAN (R 4.3.1)
 estimability   1.4.1      2022-08-05 [1] CRAN (R 4.3.0)
 evaluate       0.21       2023-05-05 [1] CRAN (R 4.3.1)
 extrafont    * 0.19       2023-01-18 [1] CRAN (R 4.3.0)
 extrafontdb    1.0        2012-06-11 [1] CRAN (R 4.3.0)
 fansi          1.0.4      2023-01-22 [1] CRAN (R 4.3.1)
 farver         2.1.1      2022-07-06 [1] CRAN (R 4.3.1)
 fastmap        1.1.1      2023-02-24 [1] CRAN (R 4.3.1)
 fastmatch      1.1-3      2021-07-23 [1] CRAN (R 4.3.0)
 forcats      * 1.0.0      2023-01-29 [1] CRAN (R 4.3.1)
 foreach        1.5.2      2022-02-02 [1] CRAN (R 4.3.1)
 foreign        0.8-84     2022-12-06 [1] CRAN (R 4.3.0)
 Formula        1.2-5      2023-02-24 [1] CRAN (R 4.3.0)
 fs             1.6.2      2023-04-25 [1] CRAN (R 4.3.1)
 future         1.33.0     2023-07-01 [1] CRAN (R 4.3.1)
 future.apply   1.11.0     2023-05-21 [1] CRAN (R 4.3.1)
 gdata          2.19.0     2023-05-05 [1] CRAN (R 4.3.1)
 generics       0.1.3      2022-07-05 [1] CRAN (R 4.3.1)
 ggdendro     * 0.1.23     2022-02-16 [1] CRAN (R 4.3.1)
 ggeffects    * 1.2.3      2023-06-11 [1] CRAN (R 4.3.1)
 ggplot2      * 3.4.2      2023-04-03 [1] CRAN (R 4.3.1)
 ggrepel      * 0.9.3      2023-02-03 [1] CRAN (R 4.3.1)
 globals        0.16.2     2022-11-21 [1] CRAN (R 4.3.0)
 glue           1.6.2      2022-02-24 [1] CRAN (R 4.3.1)
 gmodels      * 2.18.1.1   2022-05-17 [1] CRAN (R 4.3.1)
 gower          1.0.1      2022-12-22 [1] CRAN (R 4.3.0)
 gridExtra      2.3        2017-09-09 [1] CRAN (R 4.3.1)
 gtable         0.3.3      2023-03-21 [1] CRAN (R 4.3.1)
 gtools         3.9.4      2022-11-27 [1] CRAN (R 4.3.1)
 hardhat        1.3.0      2023-03-30 [1] CRAN (R 4.3.1)
 haven          2.5.3      2023-06-30 [1] CRAN (R 4.3.1)
 here         * 1.0.1      2020-12-13 [1] CRAN (R 4.3.1)
 highr          0.10       2022-12-22 [1] CRAN (R 4.3.1)
 Hmisc        * 5.1-0      2023-05-08 [1] CRAN (R 4.3.1)
 hms            1.1.3      2023-03-21 [1] CRAN (R 4.3.1)
 htmlTable      2.4.1      2022-07-07 [1] CRAN (R 4.3.1)
 htmltools      0.5.5      2023-03-23 [1] CRAN (R 4.3.1)
 htmlwidgets    1.6.2      2023-03-17 [1] CRAN (R 4.3.1)
 httpuv         1.6.11     2023-05-11 [1] CRAN (R 4.3.1)
 httr           1.4.6      2023-05-08 [1] CRAN (R 4.3.1)
 hypr         * 0.2.3      2022-08-08 [1] CRAN (R 4.3.1)
 igraph         1.5.0      2023-06-16 [1] CRAN (R 4.3.1)
 insight        0.19.3     2023-06-29 [1] CRAN (R 4.3.1)
 ipred          0.9-14     2023-03-09 [1] CRAN (R 4.3.1)
 iterators      1.0.14     2022-02-05 [1] CRAN (R 4.3.1)
 jsonlite       1.8.7      2023-06-29 [1] CRAN (R 4.3.1)
 kableExtra   * 1.3.4      2021-02-20 [1] CRAN (R 4.3.1)
 knitr        * 1.43       2023-05-25 [1] CRAN (R 4.3.1)
 labeling       0.4.2      2020-10-20 [1] CRAN (R 4.3.0)
 later          1.3.1      2023-05-02 [1] CRAN (R 4.3.1)
 lattice      * 0.21-8     2023-04-05 [1] CRAN (R 4.2.3)
 lava           1.7.2.1    2023-02-27 [1] CRAN (R 4.3.1)
 libcoin        1.0-9      2021-09-27 [1] CRAN (R 4.3.1)
 lifecycle      1.0.3      2022-10-07 [1] CRAN (R 4.3.1)
 listenv        0.9.0      2022-12-16 [1] CRAN (R 4.3.1)
 lme4         * 1.1-34     2023-07-04 [1] CRAN (R 4.3.1)
 lmerTest     * 3.1-3      2020-10-23 [1] CRAN (R 4.3.1)
 lubridate    * 1.9.2      2023-02-10 [1] CRAN (R 4.3.1)
 magrittr       2.0.3      2022-03-30 [1] CRAN (R 4.3.1)
 MASS           7.3-60     2023-05-04 [1] CRAN (R 4.3.1)
 Matrix       * 1.5-4.1    2023-05-18 [1] CRAN (R 4.3.1)
 matrixStats    1.0.0      2023-06-02 [1] CRAN (R 4.3.1)
 memoise        2.0.1      2021-11-26 [1] CRAN (R 4.3.1)
 mgcv           1.8-42     2023-03-02 [1] CRAN (R 4.3.1)
 mime           0.12       2021-09-28 [1] CRAN (R 4.3.0)
 miniUI         0.1.1.1    2018-05-18 [1] CRAN (R 4.3.1)
 minqa          1.2.5      2022-10-19 [1] CRAN (R 4.3.1)
 mitools        2.4        2019-04-26 [1] CRAN (R 4.3.1)
 ModelMetrics   1.2.2.2    2020-03-17 [1] CRAN (R 4.3.1)
 modeltools   * 0.2-23     2020-03-05 [1] CRAN (R 4.3.0)
 multcomp       1.4-25     2023-06-20 [1] CRAN (R 4.3.1)
 MuMIn        * 1.47.5     2023-03-22 [1] CRAN (R 4.3.1)
 munsell        0.5.0      2018-06-12 [1] CRAN (R 4.3.1)
 mvtnorm      * 1.2-2      2023-06-08 [1] CRAN (R 4.3.1)
 nlme           3.1-162    2023-01-31 [1] CRAN (R 4.3.1)
 nloptr         2.0.3      2022-05-26 [1] CRAN (R 4.3.1)
 nnet           7.3-19     2023-05-03 [1] CRAN (R 4.3.1)
 numDeriv       2016.8-1.1 2019-06-06 [1] CRAN (R 4.3.0)
 optimx       * 2022-4.30  2022-05-10 [1] CRAN (R 4.3.1)
 parallelly     1.36.0     2023-05-26 [1] CRAN (R 4.3.0)
 parameters   * 0.21.1     2023-05-26 [1] CRAN (R 4.3.1)
 party        * 1.3-13     2023-03-17 [1] CRAN (R 4.3.1)
 patchwork    * 1.1.2      2022-08-19 [1] CRAN (R 4.3.1)
 performance  * 0.10.4     2023-06-02 [1] CRAN (R 4.3.1)
 permimp      * 1.0-2      2021-09-13 [1] CRAN (R 4.3.1)
 permute      * 0.9-7      2022-01-27 [1] CRAN (R 4.3.1)
 phangorn     * 2.11.1     2023-01-23 [1] CRAN (R 4.3.1)
 pillar         1.9.0      2023-03-22 [1] CRAN (R 4.3.1)
 pkgbuild       1.4.2      2023-06-26 [1] CRAN (R 4.3.1)
 pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.3.1)
 pkgload        1.3.2      2022-11-16 [1] CRAN (R 4.3.1)
 plyr           1.8.8      2022-11-11 [1] CRAN (R 4.3.1)
 png            0.1-8      2022-11-29 [1] CRAN (R 4.3.0)
 pracma         2.4.2      2022-09-22 [1] CRAN (R 4.3.1)
 prettyunits    1.1.1      2020-01-24 [1] CRAN (R 4.3.1)
 princurve      2.1.6      2021-01-18 [1] CRAN (R 4.3.1)
 pROC           1.18.4     2023-07-06 [1] CRAN (R 4.3.1)
 processx       3.8.2      2023-06-30 [1] CRAN (R 4.3.1)
 prodlim        2023.03.31 2023-04-02 [1] CRAN (R 4.3.1)
 profileModel   0.6.1      2021-01-08 [1] CRAN (R 4.3.1)
 profvis        0.3.8      2023-05-02 [1] CRAN (R 4.3.1)
 promises       1.2.0.1    2021-02-11 [1] CRAN (R 4.3.1)
 proxy          0.4-27     2022-06-09 [1] CRAN (R 4.3.1)
 ps             1.7.5      2023-04-18 [1] CRAN (R 4.3.1)
 purrr        * 1.0.1      2023-01-10 [1] CRAN (R 4.3.1)
 quadprog       1.5-8      2019-11-20 [1] CRAN (R 4.3.0)
 R.cache        0.16.0     2022-07-21 [1] CRAN (R 4.3.1)
 R.methodsS3    1.8.2      2022-06-13 [1] CRAN (R 4.3.0)
 R.oo           1.25.0     2022-06-12 [1] CRAN (R 4.3.0)
 R.utils        2.12.2     2022-11-11 [1] CRAN (R 4.3.1)
 R6             2.5.1      2021-08-19 [1] CRAN (R 4.3.1)
 randomForest   4.7-1.1    2022-05-23 [1] CRAN (R 4.3.1)
 ranger         0.15.1     2023-04-03 [1] CRAN (R 4.3.1)
 RColorBrewer   1.1-3      2022-04-03 [1] CRAN (R 4.3.0)
 Rcpp           1.0.11     2023-07-06 [1] CRAN (R 4.3.1)
 readr        * 2.1.4      2023-02-10 [1] CRAN (R 4.3.1)
 recipes        1.0.6      2023-04-25 [1] CRAN (R 4.3.1)
 remotes      * 2.4.2      2021-11-30 [1] CRAN (R 4.3.1)
 reshape2       1.4.4      2020-04-09 [1] CRAN (R 4.3.1)
 rlang          1.1.1      2023-04-28 [1] CRAN (R 4.3.1)
 rmarkdown      2.23       2023-07-01 [1] CRAN (R 4.3.1)
 rpart          4.1.19     2022-10-21 [2] CRAN (R 4.3.1)
 rprojroot      2.0.3      2022-04-02 [1] CRAN (R 4.3.1)
 rstudioapi     0.14       2022-08-22 [1] CRAN (R 4.3.1)
 Rttf2pt1       1.3.12     2023-01-22 [1] CRAN (R 4.3.0)
 rvest          1.0.3      2022-08-19 [1] CRAN (R 4.3.1)
 sandwich     * 3.0-2      2022-06-15 [1] CRAN (R 4.3.1)
 scales       * 1.2.1      2022-08-20 [1] CRAN (R 4.3.1)
 sessioninfo    1.2.2      2021-12-06 [1] CRAN (R 4.3.1)
 shiny          1.7.4.1    2023-07-06 [1] CRAN (R 4.3.1)
 sjlabelled     1.2.0      2022-04-10 [1] CRAN (R 4.3.1)
 snakecase      0.11.0     2019-05-25 [1] CRAN (R 4.3.1)
 stringi        1.7.12     2023-01-11 [1] CRAN (R 4.3.0)
 stringr      * 1.5.0      2022-12-02 [1] CRAN (R 4.3.1)
 strucchange  * 1.5-3      2022-06-15 [1] CRAN (R 4.3.1)
 styler         1.10.1     2023-06-05 [1] CRAN (R 4.3.1)
 survey         4.2-1      2023-05-03 [1] CRAN (R 4.3.1)
 survival       3.5-5      2023-03-12 [1] CRAN (R 4.3.1)
 svglite        2.1.1      2023-01-10 [1] CRAN (R 4.3.1)
 systemfonts    1.0.4      2022-02-11 [1] CRAN (R 4.3.1)
 TH.data        1.1-2      2023-04-17 [1] CRAN (R 4.3.1)
 tibble       * 3.2.1      2023-03-20 [1] CRAN (R 4.3.1)
 tidyr        * 1.3.0      2023-01-24 [1] CRAN (R 4.3.1)
 tidyselect     1.2.0      2022-10-10 [1] CRAN (R 4.3.1)
 tidyverse    * 2.0.0      2023-02-22 [1] CRAN (R 4.3.1)
 timechange     0.2.0      2023-01-11 [1] CRAN (R 4.3.1)
 timeDate       4022.108   2023-01-07 [1] CRAN (R 4.3.0)
 tzdb           0.4.0      2023-05-12 [1] CRAN (R 4.3.1)
 urlchecker     1.0.1      2021-11-30 [1] CRAN (R 4.3.1)
 usethis        2.2.2      2023-07-06 [1] CRAN (R 4.3.1)
 utf8           1.2.3      2023-01-31 [1] CRAN (R 4.3.1)
 VADIS        * 1.0.1      2023-07-14 [1] local
 vctrs          0.6.3      2023-06-14 [1] CRAN (R 4.3.1)
 vegan        * 2.6-4      2022-10-11 [1] CRAN (R 4.3.1)
 viridisLite    0.4.2      2023-05-02 [1] CRAN (R 4.3.1)
 webshot        0.5.5      2023-06-26 [1] CRAN (R 4.3.1)
 withr          2.5.0      2022-03-03 [1] CRAN (R 4.3.1)
 xfun           0.39       2023-04-20 [1] CRAN (R 4.3.1)
 xml2           1.3.5      2023-07-06 [1] CRAN (R 4.3.1)
 xtable         1.8-4      2019-04-21 [1] CRAN (R 4.3.1)
 yaml           2.3.7      2023-01-23 [1] CRAN (R 4.3.0)
 zoo          * 1.8-12     2023-04-13 [1] CRAN (R 4.3.1)

 [1] C:/Users/grafmilj/Documents/R-4.0/libraries
 [2] C:/Program Files/R/R-4.3.1/library

──────────────────────────────────────────────────────────────────────────────

References

Bryant, Davis & Vincent Moulton. 2004. Neighbor-Net: An Agglomerative Method for the Construction of Phylogenetic Networks. Molecular Biology and Evolution 21(2). 255–265. doi:bghtr3.
Gelman, Andrew. 2008. Scaling regression inputs by dividing by two standard deviations. Statistics in Medicine 27(15). 2865–2873. doi:dbztxr.
Harrell, Frank E. 2001. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York, NY: Springer New York.
Szmrecsanyi, Benedikt, Jason Grafmiller, Benedikt Heller & Melanie Röthlisberger. 2016. Around the world in three alternations: Modeling syntactic variation in varieties of English. English World-Wide 37(2). 109–137. doi:gf7ntq.